Search: some PDF's not in search results

Hi,

Lately I found that with many searches documents (PDF) do not appear in the results. DT version 2.5.2

Should match: 2 nearly identical (OCR) PDF files
Type: PDF+Text
Search options: prefix while typing, ignore diacritics, fuzzy. Search: all. Also tried other combinations.
Search: also tested CMD-SHIFT-F

Example: search “2010023”.
Search: Select text “2010023” in PDF, copy, paste in DT search box: no results – should be 2 PDFs
Open PDF in Finder and then preview: search for “2010023”: OK!

Example: search “Totaalservice”.
Search: select “Totaalservice” in PDF, copy, paste in DT search: only 1 result – should be 2 PDFs
Open PDFs in Finder and then preview: search for “Totaalservice”: OK!

The finder file properties of these 2 files are the same, just like the “Show Info” in DT.
I have not yet found a trend (which files I cannot seam to search)

Any ideas?

My guess is that you are using the Toolbar query field, and have inadvertently used wrong search settings.

Try the full Search window (Tools > Search). Here all the query settings are visible for inspection (except any in the Advanced button settings).

DEVONthink’s search will work for any alphanumeric string in searchable documents.

Of course, in the case of PDFs, a search for text in the document Content will not work on an image-only document. Similarly, a search will fail if the only occurrence of the text was garbled by an OCR error.

First of all: thank you for the quick response!

  • The PDF’s are all PDF+Text (OCR by both DT and Scansnap software)
  • All searches were initiated by selecting the text in the PDF, copy and then pasting it in the search (both Tools/Search and the search toolbar).

The thing that puzzles me is the inconsistent behavior.

  • doc1: copy text, search: only doc2 in the results
  • doc2: copy text (same), search: only doc2 in the results

I did also copy/paste the text into a text editor to see what it looked like (identical).
Don’t know if there’s a way to completely rebuild the DB and the search indexes.

DJ

I never, ever use the little query search field in the Toolbar. I always use the full Search window (Tools > Search), because it is much easier to examine for configuration, allows search features not available in the little query field, and provides still more power, including quick generation of smart groups. I could go on to list still more reasons why I always prefer the Search window…

If you have two PDFs each of which contains a text string that can be found using the Find routine, DEVONthink searches will also always work. If as you reported one wasn’t found using a DEVONthink search, my guess is that the search wasn’t properly configured. For example, your query field search was limited to a specific group or database, so that the second item was not searched for.

Update: PDF+text document created by Scansnap OCR cannot be found. After running the Devonthink OCR on the same document it can be found.
Most of the PDF’s OCR’d by Scansnap work fine.

– TEST: take PDF+text that cannot be found, have DT re-OCR it and then test again
search argument: “113641628”
search settings DT Search bar: ALL, ignore diacritics, fuzzy, prefix
search settings DT CMD-SHIFT-F: ALL, ignore diacritics, fuzzy, any, any, any, correct-database,
search settings DT document open: ignore case, entire document

 2011_11_12_17_27_34.pdf copy ([b]Scansnap OCR[/b]) - DT search NOT FOUND (both search-bar & CMD-SHIFT-F)
      [u]OCR text copy paste[/u]: soort belasting Belastingen belastingjaar 2011 aanslagnummer 113641628 
                               dagtekening aanslagbiljet 28-02-2011 laatste vervaltermijn 30-04-2011
      [u]open PDF in preview[/u]: search for 113641628: OK
      [u]open PDF in DT[/u]: search for 113641628: OK

 2011_11_12_17_27_34.pdf copy DTocr ([b]DT OCR[/b]) - DT search FOUND  (both search-bar & CMD-SHIFT-F)
      [u]OCR text copy paste[/u]: soort belasting belastingjaar aanslagnummer dagtekening aanslagbiljet laatste vervaltermijn
                               subjectnummer BSN nummer
                               Belastingen 2011 113641628 28-02-2011 30-04-2011
      [u]open PDF in preview[/u]: search for 113641628: OK
      [u]open PDF in DT[/u]: search for 113641628: OK