Problem with searching

rolfgieb · August 6, 2006, 6:28am

Certain words/phrases that I know are contained in documents in my database are not being found when I do a search of the database (Tools > Search). I have tried changing the search options, but to no avail. If I open the document in question and do a simple search (Cmd + F), the said word/phrase will in some cases be found and in other cases not. Any ideas why this might be happening?

On a slightly related subject, when I click the Words button near the bottom of an open document to check word frequency, I have noticed that in the case of documents imported using the “Save to DEVONthink Pro” script in the Print dialogue words with a frequency of 1 include many words consisting in fact of two, three or more words that have been run together. Is this a bug or a peculiarity of my own setup?

Any help much appreciated,

Rolf

cgrunenberg · August 6, 2006, 7:33am

No, that’s not a bug. The PDFKit of Mac OS X 10.4 (used by Preview or DEVONthink for example) is not always able to convert PDF documents successfully to text. Therefore words are (among others) sometimes concatenated and searching fails.

An alternative might be to use the pdftotext utility instead of the PDFKit (see preferences). Then export one problematic PDF document and import it again using pdftotext - does it make a difference?

Bill_DeVille · August 6, 2006, 7:40am

Rolf, I’ll make a bet with you. If a document contains a word but Tools > Search can’t find it, it’s because the search options are limited so as to exclude the document. Remove the unintended limitation and the document will be found.

Make sure that

Search for: is All;
Operator: is All Words or Any Word (or Phrase)
Comparison: is ignore case
State: is Any
Locking: is Any
Label: is Any
Search in: is Database

Why do run-together words show up in the Words listing for documents? That’s an artifact of PDF creation, so is a kind of bug – but not done by DT Pro. I’ve observed this in PDFs produced by Acrobat, both Mac and Windows versions, as well as in PDFs produced by the “print to PDF” feature of Mac OS X.

rolfgieb · August 6, 2006, 9:03pm

Are you sure you want to go ahead with that bet, Bill? I was doing exactly what you suggest, but was still not getting the desired results. I then thought of reimporting the “offending” document as RTF, and the desired words were immediately found. Would the fact that they were not found when the document was imported as a PDF be related to the shortcomings of the PDFKit mentioned by Christian? And does this mean that to ensure that one gets the desired results when doing a search one should import documents as RTF rather than PDF?

Rolf

Bill_DeVille · August 6, 2006, 9:16pm

Hi, Rolf?

Will you let me off the hook?

I do almost all my Web captures as rich text capture of selected text and images, e.g., Science Magazine online, New York Times, Nature, etc.

I prefer that to the “print to PDF” or alternative HTML or WebArchive captures because the rich text capture has working hyperlinks and I can avoid advertisements and other unwanted material from a page. Of course, I do use the print to PDF “Save to DEVONthink Pro.scpt” routine when I’m saving banking records of checks, software registrations, etc. There are enough search cues that I never have problems finding those records, and the PDF image is important in many cases.

You can check that problematic PDF by selecting it, then choosing Data > Convert - plain text. Perhaps you had hit the run-together words problem in that one.

Note: When I’m doing internal searching in large PDF documents I do that in Preview, which has no problem finding runtogether words.

rickl · August 23, 2006, 2:29am

Choosing Tools > Concordance shows me that I’ve had this runtogether problem countless times. I took one of the culprits and, sure enough, using pdftotext seems to solve the problem (though there appear to be a couple of examples of words being split up inappropriately). Would you recommend that I leave pdftotext selected in Preferences without further thought, or are there any advantages to PDFKit that I ought to be aware of?

Incidentally, what is the default, and why was it chosen as the default over the alternative?

badger · August 28, 2006, 3:35pm

I, too, would like to hear an answer (opinion) to rickl’s questions regarding PDFKit vs pdftotext and why PDFKit is default, given the goal of search accuracy and therefore document ranking in search results.

Bill_DeVille · August 28, 2006, 8:20pm

Christian is taking a well-deserved vacation, or I would punt to him on that question.

Run-together text is reasonably uncommon in my PDF files, so I haven’t worried very much about search accuracy, especially as Command-F or Preview’s search can still handle the run-togethers.

Many of the cases of run-together text or broken strings already existed in the PDFs before import into DT Pro. I’ve also seen such text in the Windows environment.

So (a) although I’ve experimented a bit, I don’t have any statistics comparing PDFKit to pdftotext; and (b) as I see the phenomenon in some PDFs that i’ve downloaded as an already-existing situation and © most of my PDFs are large enough that DT Pro searches will find them anyway – for the moment I’m treating this as a minor aggravation.

I’ve OCR’d several thousand pages. Last night I scanned/OCR’d a number of pages into a new DT Pro database, so I’ve got some statistics on the occurrence of run-togethers in this database (I browsed the Concordance):

54 run-together words (probably more; I didn’t look at strings of less than 6 characters, and did quick inspections)
10,976 unique words
57,647 total words

The percentage of run-togethers is pretty small. All but a couple of them had a frequency of 1 (“andthe” had a frequency of 2).

Did they result from OCR recognition errors, errors in saving as PDF+text in the OCR plugin to DT Pro, or errors in capture by DT Pro? I don’t know.

As a practical matter, although I wish there were no errors at all, I’m delighted by the quality of searchable text in this database. I’ve had to evaluate lots of experimental data resulting, e.g. from chemical analysis.

From that perspective, the frequency of errors in this data set would be acceptable. In real-world sampling and analysis events there are always going to be some "errors’ resulting from variables – sampling errors, choice of analytical methodology, instrumentation variables, etc.

If I were to seek better accuracy I could look up (from the Concordance view) the documents that have run-togethers and correct them. For the purposes of this database I’m not going to bother. Text editing can be done in PDF, but it’s a major hassle. If important, I’ll correct the error by placing the corrected string in the Comment field.

Bottom line: the universe isn’t fair. Always expect that PDFs you download from others may have some text glitches, and that those you create may have them. So a DT Pro Phrase or Spelling or Concordance approach may find that document that has a hidden glitch. To find it using regular search approaches, a workaround would be to “unpack/retype” the run-together (or broken) string in the Comments field.

Fortunately, there’s usually enough redundancy of text content in documents that the occasional glitch won’t keep that document from being found.