Christian is taking a well-deserved vacation, or I would punt to him on that question.
Run-together text is reasonably uncommon in my PDF files, so I haven’t worried very much about search accuracy, especially as Command-F or Preview’s search can still handle the run-togethers.
Many of the cases of run-together text or broken strings already existed in the PDFs before import into DT Pro. I’ve also seen such text in the Windows environment.
So (a) although I’ve experimented a bit, I don’t have any statistics comparing PDFKit to pdftotext; and (b) as I see the phenomenon in some PDFs that i’ve downloaded as an already-existing situation and © most of my PDFs are large enough that DT Pro searches will find them anyway – for the moment I’m treating this as a minor aggravation.
I’ve OCR’d several thousand pages. Last night I scanned/OCR’d a number of pages into a new DT Pro database, so I’ve got some statistics on the occurrence of run-togethers in this database (I browsed the Concordance):
- 54 run-together words (probably more; I didn’t look at strings of less than 6 characters, and did quick inspections)
- 10,976 unique words
- 57,647 total words
The percentage of run-togethers is pretty small. All but a couple of them had a frequency of 1 (“andthe” had a frequency of 2).
Did they result from OCR recognition errors, errors in saving as PDF+text in the OCR plugin to DT Pro, or errors in capture by DT Pro? I don’t know.
As a practical matter, although I wish there were no errors at all, I’m delighted by the quality of searchable text in this database. I’ve had to evaluate lots of experimental data resulting, e.g. from chemical analysis.
From that perspective, the frequency of errors in this data set would be acceptable. In real-world sampling and analysis events there are always going to be some "errors’ resulting from variables – sampling errors, choice of analytical methodology, instrumentation variables, etc.
If I were to seek better accuracy I could look up (from the Concordance view) the documents that have run-togethers and correct them. For the purposes of this database I’m not going to bother. Text editing can be done in PDF, but it’s a major hassle. If important, I’ll correct the error by placing the corrected string in the Comment field.
Bottom line: the universe isn’t fair. Always expect that PDFs you download from others may have some text glitches, and that those you create may have them. So a DT Pro Phrase or Spelling or Concordance approach may find that document that has a hidden glitch. To find it using regular search approaches, a workaround would be to “unpack/retype” the run-together (or broken) string in the Comments field.
Fortunately, there’s usually enough redundancy of text content in documents that the occasional glitch won’t keep that document from being found.