Zapping gremlins on exporting text

korm · April 18, 2010, 12:00pm

I have some workflow scripts that make OPMLs of various selections so that I can import things into other programs. Frequently the OPML fails because of “gremlins” (invisible characters) in the Name field. (Many items are webarchives, etc., so the gremlin originates somewhere back in the process where the clipping first got into DTPO. I know no way to determine that DTPO document names contain these characters until we export the name in a script or elsewise - “show invisible characters” doesn’t apply to doc names.)

To fix this, I open the exported file in BBEDIT, use the zap gremlins feature there, and then all is well.

Looking for suggestions on whether I can fix the problem without the BBEDIT step, i.e., in the script.

I build the OPML from DTPO document fields (Name, Comment, etc.), putting each field “as string”, and the whole constructed file “as Unicode text”.

cturner · April 18, 2010, 12:24pm

Sure korm, I do this all the time. Here’s a snippet that escapes text in a passage to make it suitable for JSON parsers:

#! /usr/bin/python

import re, sys


def escape(m):
  if (m.group(0) == '"'):
    return '\\"'
  elif (m.group(0) == '\t'):
    return '\\t'
  elif (m.group(0) == '\n'):
    return '\\n'
  elif (m.group(0) == '\f'):
    return '\\f'
  elif (m.group(0) == '\r'):
    return '\\r'
  elif (m.group(0) == '\b'):
    return '\\b'
  elif (m.group(0) == '/'):
    return '\/'
  elif (m.group(0) == '\\'):
    return '\\\\'
  else:
    return 'FOO'

print re.sub('(")|(\t)|(\n)|(\f)|(\r)|(\b)|(/)|(\\\\)', escape, sys.stdin.read())

Most of my work these days is with Python and py-appscript, which is liberating after the frustration of Applescript.

You didn’t say what the actual gremlins were, but assuming they are characters below 0x20 and perhaps ESC, I might pass a string out to a utility like Darwin “tr,” rather than going through character-by-character in Applescript. (This choice largely for speed and maybe readability)

What else do you want to know?

Best wishes,

Charles

(EDIT: Of course another solution would be to script BBEdit to open your OPML file, zap gremlins and save…)

cturner · April 18, 2010, 12:39pm

Actually korm, if you want to send me your script, and a sample “bad output” OPML file, I’d be happy to make a look at it.

I think you can PM me for an email address…

C

korm · April 18, 2010, 2:08pm

Doh
Of course.
Thanks!

cturner · April 18, 2010, 3:35pm

Although, if the string are short, the overhead of spawning a shell may be much slower than just going through char-by-char.

If you’re content with just 7-bit ASCII it’s easy. You can put some type of logic like:

if ((theChar < 0x20) || (theChar < 0x7e))
   theChar = 0x20

So if it’s “less than SPACE” or “greater than TILDE,” it gets turned into a SPACE.

If you’re using 8-bit characters, then the logic gets more complicated.

HTH, Charles

korm · April 18, 2010, 11:47pm

With thanks to Charles, here’s a method to use unix tr to remove non-printable characters from the Name of a document


set theName to do shell script ("echo " & quoted form of (name of this_item as string) & " | /usr/bin/tr -cd '[:print:]'")