Issues with non-searcable to searchable PDF conversion.

releazme · April 17, 2009, 10:05am

First of all, I love the possibility of converting non-searchable pdf files into searchable ones. This is a great asset for me.

But I have one problem with this process.

The scenario is this: I use DT together with Papers. I let Papers organise my PDF files in a folder that is synchronised with my DT database. This has worked well up til now. But when I convert a file through DT, I end up with a new pdf file stripped of all metadata that is saved somewhere on my harddisk.

What I would like is to have the possibility of automatically replacing the origninal non-searchable pdf with the new searchable one. Also, I would also like the latter to retain the metadata of the original.

Is this at all possible, or should I post this as a feature request?

Thanks.

cturner · April 17, 2009, 10:40am

You’ll want to INDEX your Papers database instead of IMPORTING it.

Index generates a concordance on a collection of files “in place” on your HD instead of moving them into a DTPO database. So Papers can still find them because they’re where it put them.

INDEX is under the File menu.

Charles

releazme · April 17, 2009, 11:55am

Thanks for the clarification.

That is actually what I have done, and was what I tried to, admittedly rather poorly, explain when I wrote that I let Papers organise the PDFs and then synchronised this folder with DT. What I meant was that my Papers folder is indexed in DT and regurlarly synchronised.

But the problem with converting the PDF files still remains.

So, in case I did not make myself clear in the first post: The problem with the conversion process is that it produces a copy of the original file in a folder other than my Papers folder (where the original is). In order to keep things organised the way it was, I have to replace the original with the new one manually.

This is what I would like to be able to do: Open DT; select a non-searchable pdf in the indexed Papers folder; go to the data menu->convert->to searchable Pdf; let the OCR process do what it has to do; have a prompt that asks whether I want to replace the original file with the new one create a new file somewhere (Or the possibility to determine this via preferences).

Hope that makes more sense.

annard · April 17, 2009, 12:36pm

This is unlikely to happen. But nothing prevents you from scripting this.

releazme · April 17, 2009, 12:41pm

Ok, thank you. Too bad for me then

Guess I will have to find someone with scripting abilities.

acl · April 17, 2009, 5:16pm

If I understood correctly, the metadata you’re interested in is the data Papers has, ie, things like author, journal reference etc. Unfortunately, these are not stored in the pdf but by papers itself, and papers is completely unscriptable, so I don’t see how it can be done.

I guess PDFs from some sources do have some metadata but I haven’t checked (and it’s certainly not generally true).

Another point. Suppose papers was scriptable. What would I do with the author, reference etc? Would I just dump them into the comments field? I guess one could come up with a tag system (“author: blah” etc) allowing searching, but it doesn’t seem all that robust to me.

Or did I misunderstand what it is you want?