Filter searches on parts of documents or on document formats

Hi

A lot of searches I do result in 1000s of results. Typically on a search engine such as Google I’ll limit searching by specifying that (typically) one of my search terms must be in the title of the web-page or the URL, etc.

Would it be possible to add this sort of feature to DA - so that you could specify that search term(s) must be in a Heading (e.g. <H1, H2, H3…> HTML tag or in the tag or in the URL itself. Using the Heading tag would also help in Blog searches, etc.

I also often limit searches by specifying the document format (e.g. PDF, DOC, etc. as in the LinkedDocument scanner). Previous posts have suggested that it would be possible to search within PDFs, etc. in release 2.00 of DA. However this still does not seem to work (or doesn’t for me). When will PDFs (and other file formats such as Word, PowerPoint, etc.) become searchable and listed in the DA outputs. (And when they do, would it be possible to limit outputs to specified formats. Many of the searches I do give the majority of results as PDF files so it would help if I could restrict output to just these rather than the holding page that links to them!)

Unfortunately until these changes are implemented my use of DA has to be limited. I love the advanced Boolean features and similar, but even with these I get too many false results. Just adding the file format and anchor/heading/title filters would mean I could use DA almost all the time for my searching.

You can already limit the search to specific document formats by using a scanner (if you want to look for PDF/Word etc. use the “Linked Documents”). Click on the “Settings” button in the toolbar to show these options in a drawer.

DA 2 does not search into PDFs. This is something that we’re looking into. The local searching done by DA is on the content text of HTML pages only, independent of the formatting. For some sites where the information is always presented in a defined format, a search plugin can get the title, date etc from a result page. But for a general HTML page this is impossible to do for a plugin. However, I haven’t tried using Google specific keywords as a phrase (i.e. in between “”). Of course, this would be very search engine specific, but you can give it a try.

Also, don’t forget the “Secondary Query” option in the “Settings” drawer, this allows you to do a search on results locally, independent of the search engine, and this is not limited to the capabilities of the search engine.

Hi

I was worried that this would be the reply.

My problem is that a LOT of what I’m looking for resides in non-HTML format documents such as PDFs but also PowerPoint and Word documents.

This means that there are many documents I need that DA will not / cannot find because they are not read. Selecting the LinkDocument scanner doesn’t help if the document itself was not found because the text was in a PDF file.

Similarly, searching in the Title HTML tag can limit what I’m looking for considerably. I get much better searches this way - typcially getting a search of say 100,000 items down to a couple of 100, of which around 20-30 will be relevant. I’d love to be able to do such searches in DA - and post-processing with the secondary search field doesn’t help as what I want is the same expression but just ranked according relevancy, with the key term being the fact the search term is part of the title.

Benaron:

I know this suggestion doesn’t completely satisfy what you are looking for. But I sometimes do very large collections of search results in DEVONagent, and then move them over to a new DT Pro database.

Several months ago I collected more than 10,000 results from DA searches on a topic and moved them to DT Pro for subsequent filtering and organization.

Although DT Pro currently lacks some of the query power of DA, one can pull some rather fancy logical tricks. Here’s an example:

[1] Search the DT Pro database for termx OR termy.

[2] Select and replicate the search results to a new group.

[3] Search the new group for termx and move the search results to a new group.

Result: Now I have one group that is equivalent to a termx NOT termy search, and another that’s equivalent to a termy NOT termx group. This separation resulted from multilevel searches. And I can easily populate another group with items that contain both terms.

And of course DT Pro can ‘read’ the text content of your PDF and Word files. Don’t forget the AI features in DT Pro such as See Also and See Selected Text. Or the ability to check terms used in a document and click on one to initiate another search. Or select a phrase and do a Lookup search. Or select a term and Option-click it.

A future 2.x release will probably support both PDF, RTF & Word documents and matching keywords, title and/or URL.