Inaccurate searches - help please

Mirithiar · July 5, 2016, 9:58am

Hello,

I’m using DT Pro Office. Everything is PDF + text.

The main reason I went for DT and the Pro Office version was the OCR, smart searches and the AI. I was hoping that DT will be a great way to manage, search and explore my pdf library (I’m a PhD researcher).

However, it seems like the searches are really inaccurate and in many cases DT is simply unable to locate the pdfs I’m after. Even when I copy the text directly from a pdf and then use that to search for that pdf it doesn’t come up - surely it should be a perfect match and right there at the top of the list? I have discovered this when I tried searching for a pdf I knew I had in there and had to resort to looking for it manually… It defeats the point of keeping pdfs in DT completely.

My question is, do you have any idea why this is happening and is there anything I can do? I have checked that those pdfs are OCRed and I can’t think of anything else. I imagine that if it can’t even find single pdfs then the smart groups or AI will miss those too.

Maybe it’s only certain kind of pdfs? Maybe the OCR can’t deal with something specific and so misses out on things even though the documents are shown as searchable? If so, is there any way to locate such documents?

korm · July 5, 2016, 10:11am

The quality and accuracy of OCR (and therefore search) does depend on the source – e.g., a PDF scanned from a copy of a typed manuscript might not have 100% fidelity when OCRd.

But almost anything anyone tells you here is going to be speculation and not helpful to you, because we don’t have what you have. It would be better if you opened a ticket with Support and sent them copies of the PDFs that are having trouble being searched along with a description of what you expected and what you got.

cgrunenberg · July 5, 2016, 10:12am

Only if the text doesn’t contain search operators or if the text is quoted at least.

Greg_Jones · July 5, 2016, 10:14am

It would be helpful to let support know (and/or share here) precisely how you are searching for documents. Full search, using the Search Window (Edit menu>Find>In Database…) or using the more restrictive search box in the Toolbar? Also what search parameters have you selected, e.g Toolbar search>All, Content, Name, URL,…? You may have set search parameters/locations that by design have excluded the location and/or content of the documents that you are wanting to find.

Mirithiar · July 5, 2016, 10:30am

Thanks for the speedy replies!

Greg - Search type

Tried both, using the toolbar and through the proper search window (Tools->Search). Toolbar (or Edit menu>Find>In Database…) doesn’t find the paper at all. You are right that there seems to be a difference, Tools->Search finds it if “All” or “Content” is selected, although the paper is still quite low on the list, nowhere near the top, which is an issue when papers are not properly named (and I hoped I won’t have rename everything but just use DT searches).

cgrunenberg - the search was “Females increase offspring heterozygosity” (with the quotes).

korm - will do, I was just wondering whether anyone might have a similar experience. It’s hard to know what’s not being found, when it’s not being found… So hard to know what % of the database is affected.

The pdf that made me realise something is wrong is from Letters to Nature and published in 2003, so I would have expected it to be good enough to get OCRed properly. It’s not a photocopy, scanned version of an old typewriter paper or anything handwritten. It seems like depending on which bit of text I pick from it, it’s either found easily or has trouble - but again, hard to know how much of it is affected/how much of an issue it is.

EDIT: Thank you for ideas. I can resort to using the full search window for now, in hope that it will locate things a bit easier, and once I gather a few troublesome pdfs I will open a ticket.

korm · July 5, 2016, 10:49am

A quick test of the accuracy of the text layer in a PDF is to select the document and use Convert > to Plain Text from the contextual menu. This will extract the text layer in to a plain text file that you can examine for quality and fidelity. Obviously you’re not going to check all the documents in a database - but a sample might give you an idea of what’s going on. You can also look at the Concordance window for the whole database, or the Concordance for a single document (in the document navigation bar). In either case, double-clicking a word in the Concordance display should open a Search Windows using that word as the search predicate. This is a another spot test – but it is also a useful tool.

Frederiko · July 5, 2016, 11:50am

OCR is by its nature sophisticated guesswork but there is a way to improve the accuracy of your searches.

in the advanced search you should make frequent use of the similar words side bar.

If I search for the word “Trillion” in one of my databases the similar words sidebar shows that the database also contains the words Trillian, Billion, Criplion, Dillion, Icillion, Illion, Inillion and Inmillion. These are obvious OCR errors and I would like them to be included in my search.

Similar words.jpg

In other words DT will now be searching for all the variation of the word “Trillion” in the database(s).

I find this facility almost indispensable for working with OCR text.

I think you will be amazed how many variation of “heterozygosity” there are in your database !

Frederiko

Mirithiar · July 5, 2016, 4:23pm

korm and Frederiko - thank you very much for the tips, I shall give them a go! :]

So many things to figure out with DT and I don’t want to bug the forum with each little one.

StewartW · July 8, 2016, 10:05am

Just a thought…have you opened up the PDF in Preview and performed the text search there to see if it is DT or the PDF itself that is the issue?