Weird Search Result -- Query

Hi,

I’m experiencing something strange with my Search results in DT. Here’s what’s happening:

  1. I have a database of 1,000 PDFs, all text-searchable.
  2. If I open up PDF1 and search inside it for “Magdalene” then the term in question is highlighted. No problems there.
  3. However, if I do a full “All” or “Content” search of all 1,000 PDFs using the broad Search service (Shift+Command+F) then two other PDFs are found, both of which contain the term, but NOT PDF1. It’s as if DT doesn’t see PDF1 anymore.

Does anyone have any idea why this might be happening? (I put on a Fuzzy setting to see what would happen but a few alternate spellings popped up and nothing from PDF1.) I thought maybe it was because the term appeared only once in PDF1 and so DT assumed it was valueless, but I’ve checked the other two PDFs that popped up and the term is also just used once. I also created a new folder of just 50 PDFs, including PDF1, to see whether the size of the database was causing problems, but the same thing happened.

It’s a bit worrying and I wonder what else is not being found during searches. Isn’t DT being sold as a marvel of AI and searchability?

Hope you can help.

Often a puzzling search issue like this one can be resolved by taking a good look at the settings on the Search page. Perhaps they are set to exclude PDF1 in this case, either by the location searched (e.g., PDF1 is in the Global Inbox, but the search is in a different database), or perhaps the Advanced button previously used hasn’t been reset.

I always use the full Search window precisely because it’s much easier to examine the settings than in the little search field in the Toolbar. Of course, the full Search window also has a number of features not available in the Toolbar search field.

If “Magdaline” is in the body of the searchable text of PDF1 and that term is used as the query term in an All search for Databases, PDF1 should be listed in the search results. (I’m assuming that other settings are null, e.g., Labels, etc. are not set for a specific case, e.g., Red.

Edit > Find > Find… is a substring search but by default the search window is looking for whole words. Does ~ Magdalene find the PDF?

Hi,

Thanks for your helpful (and rapid) replies. I’ve made sure all the settings were set to All.

I’ve done a bit more exploring and found that if I OCR the article in question (most of my PDFs are from JSTOR and so are automatically searchable but this one was a paper copy) using DT’s inbuilt reader Search finds “Magdalene” without a problem.

Before, I had OCRd it using Acrobat and and though, as I said earlier, I could find Magdalene in the individual file DT wasn’t listing it in the main Search window.

Now, of course it seems as if Acrobat is at fault here but we’d still have to explain why: (a) This has not, to my knowledge, happened before even though I’ve OCRd about 100 PDFs using Acrobat; and (b) I can still find Magdalene in the file, so clearly the term has been accurately OCRd.

Do you have any idea of why that could be? I can rescan and re-OCR the article using Acrobat to see if I can replicate the problem. Maybe it was some weird one-off bug.

(By the way, the reason I’m using Acrobat rather than DT’s OCR is that, and I think I once had a discussion with Bill D. about this, is that I was finding that DT, even on medium settings, was making the PDFs massive; i.e., ballooning from 12MB to 48MB, whereas Acrobat kept it to about 16MB on regulaar settings. Admittedly, that was an extreme example – most of my PDFs are about 1 to 4 MB – but it was sufficient to make me a little wary of OCRing via DT. Maybe I was wrong or just got a freak example.)

Converting PDF documents to text for indexing is always a tricky issue. Depending on the creator and the convertor, words can be concatenated or splitted etc.