I ran this PDF through Devonthink’s Convert To Plain Text function (in the Data menu). This results in “plain” output containing ligatures. Ligatures are the fancy typographic characters for ff, fl, fi, and so on. The plain output also contains smart curly quotes.
Is the Convert To Plain Text function an OS X built-in? If not, is there the possibility to change this behavior so that curly quotes are present in its output, but the ligatures are replaced with their full, plain representation as, for example, ff, fl or fi. This is what copy and paste from Acrobat Reader does.
(Preview is completely baffled by this PDF – try it for yourself!)
I’m having trouble because I wanted to use the plain output for a BibTeX bibliography, which you may know is very sensetive to non-ascii characters. I can cope with curly quotes, but ligatures are a challenge.
Damn you Mac OS, and your sophisticated Mac Roman encoding!
If you are using the PDF Plug-in, it has many bugs and is forbidden here due to problems it causes at the server end. You must confirm that you have disabled it before access can be restored. (In Netscape try Edit -> Preferences -> Navigator -> Applications, look for Portable Document Format and uncheck the plug-in box. Or delete the pdf plugin dll file from the Program Files/Netscape/Navigator/Program/plugins directory and restart browser. Or for Acroread4/Explorer5 users, go into Acroread’s File : Preferences : General : Web_Browser_Integration and make sure the little box is unchecked. After a one-week grace period, access will be automatically re-enabled.)
I’m using DEVONagent as my default browser. I’m beginning to wonder if they’ve objected to a site search by someone’s DA search set (or DT Pro site Download), and are blocking me for that reason? In any case, they clamped a one-week access block on my computer. I’ve never tried to access their site before this evening.
That’s the problem. I was using Safari and it worked, I just tried DA and it doesn’t work. So they’ve blocked DA access. That sucks; what if this becomes common? Have you contacted their administrator (via the link they provide) about this? One of us should and see what they have to say.
Copying and pasting from the PDF in Acrobat to, say, TextEdit does indeed result in ligature-free ASCII. When I said “Preview is completely baffled with it”, I wasn’t very detailed. What I meant is that Preview cannot copy and paste text out of the document to, say, TextEdit in any sensible way. It doesn’t seem to understand line endings and spaces. That’s not the problem (although I prefer to avoid Acrobat whenever possible )
The problem is that importing the PDF into Devonthink and then converting using the built-in menu item results in a Devonthink “plain text” document, which cannot reasonably be said to be “plain” since it contains typographic characters.
The arXiv is a venerable old site, with a lot of data, and they take a very dim view of any sort of automated crawling. I have no idea what Devonagent is, other than a program to search for things, so I don’t know if its behaviour could be interpreted as such. Their Robots beware! page may be of interest.
I’m thinking that we need to notify them (as per the robots beware page) of this new type of program that is DEVONAgent, how it works (i.e., it is not a routine high-frequency search spider or crawler program) to aid our research by providing us with sophisticated search routines, web browsing on the results of the searches, and ultimate database storage in DEVONthink. However, I think I’m too tired to do a good job of explaining it at the moment and Bill could probably do it better anyway. What do you think Bill, could you contact them for this group of DA users and let us know what they say?