Converting PDF to plain text and ligatures

caek · March 24, 2005, 3:50am

I ran this PDF through Devonthink’s Convert To Plain Text function (in the Data menu). This results in “plain” output containing ligatures. Ligatures are the fancy typographic characters for ff, fl, fi, and so on. The plain output also contains smart curly quotes.

Is the Convert To Plain Text function an OS X built-in? If not, is there the possibility to change this behavior so that curly quotes are present in its output, but the ligatures are replaced with their full, plain representation as, for example, ff, fl or fi. This is what copy and paste from Acrobat Reader does.

(Preview is completely baffled by this PDF – try it for yourself!)

I’m having trouble because I wanted to use the plain output for a BibTeX bibliography, which you may know is very sensetive to non-ascii characters. I can cope with curly quotes, but ligatures are a challenge.

Damn you Mac OS, and your sophisticated Mac Roman encoding!

Bill_DeVille · March 24, 2005, 7:12am

As for me, I’m completely baffled by the Web site, which chastised me severely for having PDF plugin and has banished me from accessing the site.

caek · March 25, 2005, 12:39am

For what it’s worth, I’m using PDF Plugin and arXiv works fine for me.

Bill_DeVille · March 25, 2005, 2:14am

Here’s the message I get when I click on your link:

"Access Denied

Sadly, you do not currently appear to have permission to access arxiv.org/pdf/astro-ph/0503302

If you are using the PDF Plug-in, it has many bugs and is forbidden here due to problems it causes at the server end. You must confirm that you have disabled it before access can be restored. (In Netscape try Edit -> Preferences -> Navigator -> Applications, look for Portable Document Format and uncheck the plug-in box. Or delete the pdf plugin dll file from the Program Files/Netscape/Navigator/Program/plugins directory and restart browser. Or for Acroread4/Explorer5 users, go into Acroread’s File : Preferences : General : Web_Browser_Integration and make sure the little box is unchecked. After a one-week grace period, access will be automatically re-enabled.)

If you believe this determination to be in error, see arxiv.org/denied.html for additional information."

Actually, I don’t have PDF Plugin. I have PDF Browser Plugin, instead – and I’ve never heard of it causing server problems. The directions for plugin removal are for a Windows computer.

Out of curiousity, I followed up the message’s link to a source of additional information. But the message I got there implied that the site thinks I’m a bot or crawler.

Oh, well. I’ve been called worse.

caek · March 25, 2005, 3:04am

I use PDF Browser Plugin too, and I’ve never had any trouble using it on the arXiv. Oh well. Let’s chalk it up to experience

I’ve created a much shorter sample document that exhibits the problem I describe. It contains the ligatures for ff and fi.

ChemBob · March 25, 2005, 3:06am

Bill_DeVille:

Here’s the message I get when I click on your link:

"Access Denied

Sadly, you do not currently appear to have permission to access arxiv.org/pdf/astro-ph/0503302

If you are using the PDF Plug-in, it has many bugs and is forbidden here due to problems it causes at the server end. You must confirm that you have disabled it before access can be restored. (In Netscape try Edit → Preferences → Navigator → Applications, look for Portable Document Format and uncheck the plug-in box. Or delete the pdf plugin dll file from the Program Files/Netscape/Navigator/Program/plugins directory and restart browser. Or for Acroread4/Explorer5 users, go into Acroread’s File : Preferences : General : Web_Browser_Integration and make sure the little box is unchecked. After a one-week grace period, access will be automatically re-enabled.)

If you believe this determination to be in error, see arxiv.org/denied.html for additional information."

Actually, I don’t have PDF Plugin. I have PDF Browser Plugin, instead – and I’ve never heard of it causing server problems. The directions for plugin removal are for a Windows computer.

Out of curiousity, I followed up the message’s link to a source of additional information. But the message I got there implied that the site thinks I’m a bot or crawler.

Oh, well. I’ve been called worse.

Bill, I’m using PDF Browser Plugin too and I don’t have any problem accessing that pdf. It opens just fine. I wonder what is going on?

ChemBob

Bill_DeVille · March 25, 2005, 3:20am

ChemBob:

I’m using DEVONagent as my default browser. I’m beginning to wonder if they’ve objected to a site search by someone’s DA search set (or DT Pro site Download), and are blocking me for that reason? In any case, they clamped a one-week access block on my computer. I’ve never tried to access their site before this evening.

What browser were you using to access the site?

ChemBob · March 25, 2005, 12:51pm

That’s the problem. I was using Safari and it worked, I just tried DA and it doesn’t work. So they’ve blocked DA access. That sucks; what if this becomes common? Have you contacted their administrator (via the link they provide) about this? One of us should and see what they have to say.

ChemBob

jadenotes · March 26, 2005, 10:07pm

I am able to view the pdf with Adobe 7 plug-in in Safari and don’t seem to get the ligatures in the plug-in or in the converted to plain text doc. Also, Preview has no trouble with it.

caek · March 27, 2005, 5:21am

Copying and pasting from the PDF in Acrobat to, say, TextEdit does indeed result in ligature-free ASCII. When I said “Preview is completely baffled with it”, I wasn’t very detailed. What I meant is that Preview cannot copy and paste text out of the document to, say, TextEdit in any sensible way. It doesn’t seem to understand line endings and spaces. That’s not the problem (although I prefer to avoid Acrobat whenever possible )

The problem is that importing the PDF into Devonthink and then converting using the built-in menu item results in a Devonthink “plain text” document, which cannot reasonably be said to be “plain” since it contains typographic characters.

caek · March 27, 2005, 5:26am

The arXiv is a venerable old site, with a lot of data, and they take a very dim view of any sort of automated crawling. I have no idea what Devonagent is, other than a program to search for things, so I don’t know if its behaviour could be interpreted as such. Their Robots beware! page may be of interest.

ChemBob · March 27, 2005, 5:45am

I’m thinking that we need to notify them (as per the robots beware page) of this new type of program that is DEVONAgent, how it works (i.e., it is not a routine high-frequency search spider or crawler program) to aid our research by providing us with sophisticated search routines, web browsing on the results of the searches, and ultimate database storage in DEVONthink. However, I think I’m too tired to do a good job of explaining it at the moment and Bill could probably do it better anyway. What do you think Bill, could you contact them for this group of DA users and let us know what they say?

ChemBob