Problems in indexing

physicistjedi · January 28, 2006, 11:44pm

PDF:
I am having indexing problems with PDF’s. Tiger’s PDFKit often combines several adjacent words into one word, and built-in pdftotext often omits one or more letters from some words. And my problematic pdf’s are very clear (not scanned and OCRed) and coming from many different sources. One particular example is:
pma.caltech.edu/Courses/ph13 … 0206.2.pdf

When I look at the word list with PDFKit many words are seem to combined like “energydensityandamplitudetoincreaseastheyclimb”.
pdftotext misreads many words like “behavior” persistently. It indexes them as “ehavior”. A search with “behavior” can not find the document, but a search inside the document (Ctrl-F) can find it which is very puzzling.

I hope the problem is clear enough. Any suggestions?

RTF:
In this context I want to repeat an issue with RTF indexing which was raised before. Words in the link titles can not be indexed. This is very frustrating especially when clipping from wiki sources (like Wikipedia) where the most important words are usually links and I don’t want to loose images and links. Is there any plans for this problem in v1.1 or v2.0?

Thanks.

Bill_DeVille · January 29, 2006, 1:39am

physicistjedi:

PDF:
I am having indexing problems with PDF’s. Tiger’s PDFKit often combines several adjacent words into one word, and built-in pdftotext often omits one or more letters from some words. And my problematic pdf’s are very clear (not scanned and OCRed) and coming from many different sources. One particular example is:
pma.caltech.edu/Courses/ph13 … 0206.2.pdf

When I look at the word list with PDFKit many words are seem to combined like “energydensityandamplitudetoincreaseastheyclimb”.
pdftotext misreads many words like “behavior” persistently. It indexes them as “ehavior”. A search with “behavior” can not find the document, but a search inside the document (Ctrl-F) can find it which is very puzzling.

I hope the problem is clear enough. Any suggestions?

RTF:
In this context I want to repeat an issue with RTF indexing which was raised before. Words in the link titles can not be indexed. This is very frustrating especially when clipping from wiki sources (like Wikipedia) where the most important words are usually links and I don’t want to loose images and links. Is there any plans for this problem in v1.1 or v2.0?

You are quite right. I downloaded the suggested PDF file and looked at the words list. One example is “Alaserbeambouncesbackand” concatenated from a phrase on page 26. A similar example is “albeitslowly” concatenated from a phrase on page 8. I’ve seen this behavior in other cases, and I have a suspicion that it happens more frequently with some fonts than with others.

The concatenation may have been done by Apple’s PDFKit in ‘reading’ the text. I’ve noted this to Christian as a possible bug report to Apple – I think I’ve seen it happen. But note that if the PDF document was created by scanning printed pages and running OCR on the scan, this could instead be an artifact of the OCR process. It’s not uncommon for OCR programs to erroneously concatenate text strings. If I had to guess, it would be that this PDF was the result of a scan of pages, followed by OCR conversion to PDF.

Added by edit: If you open the PDF in Preview and search using a concatenated string from the Words list, Preview’s search finds the concatenated string. But the page display shows the words properly separated.

Because Command-F can find portions of strings, you can find the term “behavior” even within a concatenated string. And because the concatenated string “lies beneath” the PDF image, you can’t tell that the underlying text wasn’t correct.

In any case, I don’t think there’s anything DEVONtechnologies can do about such problems, as their source lies elsewhere.

Word recognition/indexing in hyperlink strings isn’t possible. Apple’s Cocoa text doesn’t “see” link text as RTF text. This can be particularly frustrating when one is working with documents containing lists of referenced materials in hypertext format. A kludge is to make a duplicate of the document and reformat it as plain text, which allows DT Pro to find the terms that were previously “hidden”. But of course the links no longer work after reformatting.

physicistjedi · January 29, 2006, 5:28am

Thanks Bill for bringing the issue to Christian, I hope Apple can solve this bug. I am sure this one is not OCRed and created directly from Latex. One clear evidence is that Devon’s pdftotext does not have any concatenation problem, so the PDF is fine. But as I said pdftotext (which is usually as good as PDFkti) mysteriously misses letters. I think Devon people can do something about it.

Best wishes.

cgrunenberg · January 29, 2006, 11:41am

This will be definitely enhanced in an upcoming release (but I can’t promise it for v1.1 at the moment). But the issues related to the PDFKit have to be fixed by Apple. In some cases (but not in this one) using pdftotext might fix the issue.

veyret · January 30, 2006, 3:17pm

I confirm that many of the thousands of pdf files I have in my database have been wrongly indexed using pdfkit, in that the count of words was zero! while the documents were visualized well but not classified…Changing to pdftotext solved the problems but I had to treat by hand all of the wrongly indexed documents!