re-import pdf?

acl · October 1, 2009, 12:09am

Hi,

I have just installed Snow Leopard. Snow Leopard itself fixes one problem I have had: in Leopard, papers downloaded from arxiv.org, which are the majority of my pdfs, do not get read correctly.

Specifically, spaces tended to be misplaced, and you’d get things like “Solitonsandsolitaryvorticesinpancake-shapedBose-Einsteincondensates” instead of the correct form.

This caused problems with searching through them, both using Spotlight and using DT; my solution was to OCR them, which made their size balloon up to several times the previous one and degraded the quality.

SL fixes this: in addition to the column selections, most pdfs appear to be read correctly. I can search them with spotlight and get back useful answers!

However: in DT, the index does not seem to be updated (ie, the correct text still does not get found, while the “wrong”, with no spaces, does). I have tried “Synchronize” but nothing. Also I have tried exporting and reimporting the pdf, no change. On the other hand, “convert to plain text” gives the correct text, with spaces!

So, can someone a) tell me how to get DT to reindex everything in the correct way? (assuming it uses SL itself to import text) b) Explain what is going on? (a is more important than b right now!).

thanks,
acl

cgrunenberg · October 1, 2009, 10:44am

Rebuilding the database (or exporting and reimporting) should update the index. Otherwise just send an example to cgrunenberg - at - devon-technologies.com and I’ll check this, thanks.

alanshutko · October 4, 2009, 3:45pm

Christian’s tip is spot on. I did exactly the same thing to reindex my database because of the exact same bug, though I found it with old PDFs of Dragon Magazine. After rebuilding the database it works perfectly.

brianparker · October 5, 2009, 8:07am

Aha! So I’m not the only one using DT for serious data.

acl · October 5, 2009, 11:16am

Just to keep those interested updated, I have sent a sample to support (=Christian) and the problem will be fixed in the next beta. (it was not solved by reimporting the whole database).

Seems to be related to which method is being used to extract the text.