Question about search operators and filters

In the searchpanels (Cmd-Alt-F or Cmd-Shift+F) is there any way to specify search filters within the search field itself? I am actually looking for the Google way e.g. “DEVONpad site:devon-technologies.com filetype:pdf”. Typing instead of clicking is much faster. Thanks, Thomas

No.

I suggest you use the full Search window (Tools > Search). Click on the “Advanced” button, where you can add the filter “Kind is PDF”. (Or PDT+Text for a searchable PDF.) The Advanced button provides the features of the smart group Editor.

Alternatively, you can add the “Kind” column to the search results view (View > Columns > Kind) and sort for Kind to identify PDF and PDF+Text documents.

Bill, thanks for your reply.
It would be nice to have this request included on the feature suggestion list.

Another question: I do not see an option to search for kind “PDF+Text”. I only find “PDF/PS”. What I have done in the past in identifying ODFs that need OCR was searching for PDFs with word count <100 (some PDFs from scientific journals are image PDFs with citation included as text, therefore not word count=0). Does this make sense?

Best, Thomas

The case you mentioned, of a PDF that includes some searchable text but then appends other pages that are image-only, will display Kind as PDF+Text.

I generally look for candidates for OCR by adding a Kind column to the view of the All PDFs smart group. (View > Columns > Kind). Sorting the PDFs by Kind then “isolates” those listed as “PDF” as possible candidates for OCR.

But as your example case illustrates, that won’t work for “mixed” PDFs that contain some searchable text pages and appended image-only pages.

Your trick of looking for PDFs that contain a limited number of words is a useful approach. You might experiment with the threshold number of words, as some abstracts might exceed 100 words.

Note: DT doesn’t allow a dual sort, so that you could sort by Kind, then by Word count. But you can emulate that by creating a new group, into which you select and replicate the items shown as PDF+Text in a Kind sort. Then add the Word count column to the view of that group, and sort those replicated items by Word count. Adding a Size column might help reduce the number of PDF+Text documents that need to be inspected for possible OCR.

If your exsmple documents include a citation reference to the journal that’s the source, you could reduce the number of items to be inspected by first doing a Search for that journal name to “isolate” the items requiring inspection to a much smaller number. You can press the “+” button to the right of the query field (in the full Search window (Tools > Search) to replicate the results list to a new group.

Comment: If you requesst PDFs of items from historical collections such as old newspaper artivles, it’s not uncommon to receive a ‘mixed’ PDF+Text that contains a searchable page or two, to which an image of the requested article is appended. Depending on the resolution of the image, OCR accuracy may or may not be acceptable when Data > Convert > to searchable PDF is run. I’ve seen some journals (especially math and physics) that distribute PDFs of such low resolution that OCR is essentially useless.