Using Azure/Google API in AppleScript to translate documents

jsmith · May 25, 2024, 4:15pm

Hi there,

I’m hoping to use a script to call the Microsoft Azure Translate API to translate text within documents. My priority at the moment is to translate news feeds but would like to be able to translate any plain text, markdown, html, etc documents if possible.

Ideally the script would translate the filename (i.e the headline for RSS news feeds) and insert an English translation of the original text above the original text with a line dividing the two so that I can easily verify the translation for salient documents (given continuing quirks with online translation services, particularly for more obscure languages).

Thank you in advance for any suggestions / support!

chrillek · May 25, 2024, 4:35pm

Please provide details on the API, a sample call would be best.

BLUEFROG · May 25, 2024, 4:39pm

Why this service specifically?
Define documents more precisely.
Do you have concerns about privacy?
What format are you using either on an individual feed or set in Preferences > RSS > Feed Format?
Do you have expectations of hyperlinks persisting in the translated text?

I hope you are seeing how something that’s quickly described often lacks sufficient information to implement quickly or even solve.

jsmith · May 25, 2024, 5:08pm

Here are basic parameters for text translation:

Example below:

set subscriptionKey to “AZURE API KEY”


set endpoint to “https://api.cognitive.microsofttranslator.com/translate?api-version=3.0&to=en”
set textToTranslate to “Este es un pez!”
set requestBody to “[{"Text": "” & textToTranslate & “"}]”
set curlCommand to “curl -X POST "” & endpoint & "" -H "Ocp-Apim-Subscription-Key: " & subscriptionKey & “" -H "Content-Type: application/json" -d '” & requestBody & “'”
set jsonResponse to do shell script curlCommand
set translatedText to do shell script “echo " & quoted form of jsonResponse & " | grep -o ‘"text":"[^"]*"’ | head -n 1 | sed ‘s/"text":"//;s/"//g’”

display dialog "Translated Text: " & translatedText

jsmith · May 25, 2024, 5:19pm

Why this service specifically?

Only Google and Azure translate more obscure languages (e.g. Amharic). Azure more cost effective than Google

Define documents more precisely.

Ideally markdown and/or HTML but I’d be happy with plain text.

Do you have concerns about privacy?

For my initial needs, no. I want to translate open source RSS feeds articles.

What format are you using either on an individual feed or set in Preferences > RSS > Feed Format?

I am currently using “automatic” and “clutter-free” which produces a HTML Text file for my test feed. I’m then converting that to markdown with smart rules and putting it in a group to keep images (due to RSS conversion issues mentioned here). I know “clutter-free” is clear same clutter regardless of format but for my test feed (Caasimada Online) other formats include various social media links.

Do you have expectations of hyperlinks persisting in the translated text?

I can certainly live without them!

I hope you are seeing how something that’s quickly described often lacks sufficient information to implement quickly or even solve.

I am, and am also as impressed as ever with the responses on this forum!

chrillek · May 26, 2024, 8:19am

Is this code working? If it is, what is the problem? If not, why – error messages?

In any case, I’d suggest dropping AppleScript for this project. You want to process text and JSON. That’s not something AS is good with. Rather use JavaScript.

And be careful with Microsoft’s documentation. At least in one case their JSON code is not correct.

After some (very basic, BTW) research, I found an example Gist illustrating how to talk to an Azure service using JXA and the Objective-C bridge:

gist.github.com

https://gist.github.com/rcarmo/f96c659f149e357e1091cbfe352af6d4#file-shortcuts-js

automator.py

# Drop this into Automator using Python 3 as the shell

from sys import stdin
from json import dumps, loads
from urllib.parse import urlencode
from urllib.request import Request, urlopen

AZURE_ENDPOINT="getyourowndeployment.openai.azure.com"
OPENAI_API_KEY="getyourownkey"
OPENAI_API_VERSION="2023-05-15"

This file has been truncated. show original

shortcuts.js

function run(input, parameters) {

    // This script is a simple example of how to use the Azure OpenAI API to generate text
    // from a macOS system service written as a JavaScript for Automation (JXA) script
    // You can drop this into a Shortcuts "Run JavaScript" action (which will only work on a Mac)
    // or use it as a starting point for a more complex system service

    ObjC.import('Foundation')
    ObjC.import('Cocoa')

This file has been truncated. show original

test.swift

import Cocoa

let url = URL(string:"https://endpoint_name.openai.azure.com/openai/deployments/deployment_name/chat/completions?api-version=2023-05-15")!,
    prompt = "Do what you are told. No more, no less."

struct Role: Codable {
    let role: String
    let content: String
}

This file has been truncated. show original

I’d suggest using this as a starting point because

it illustrates how to use your API key without putting it into the script (security!)
it hardly uses any shell code, which makes it a lot more stable and easier to maintain then all these doShellScript thingies which require very careful quoting.

jsmith · May 28, 2024, 4:03pm

Thank you again! I will le you know how it goes.

jsmith · May 29, 2024, 9:03am

So I used the Python script in the gist you mention as a framework which now does exactly what I want it to do outside of DEVONthink.

import os
import json
from urllib.request import Request, urlopen

# Set Azure Translator API details
AZURE_TRANSLATOR_ENDPOINT = "https://api.cognitive.microsofttranslator.com"
TRANSLATOR_API_KEY = "getyourtranslatorapikey"
TRANSLATOR_API_VERSION = "3.0"
TRANSLATOR_REGION = "your_region"  # Replace with your Azure region

translator_url = f"{AZURE_TRANSLATOR_ENDPOINT}/translate?api-version={TRANSLATOR_API_VERSION}"
translator_headers = {
    "Ocp-Apim-Subscription-Key": TRANSLATOR_API_KEY,
    "Content-Type": "application/json",
    "Ocp-Apim-Subscription-Region": TRANSLATOR_REGION
}

def translate_text(text, from_language="so", to_language="en"):
    data = [{"Text": text}]
    request_url = f"{translator_url}&from={from_language}&to={to_language}"
    req = Request(request_url, data=json.dumps(data).encode(), method="POST", headers=translator_headers)
    with urlopen(req) as response:
        response_content = json.loads(response.read().decode())
        translated_text = response_content[0]['translations'][0]['text']
    return translated_text

def process_markdown_file(input_path, output_dir, target_language="en"):
    # Read the markdown file
    with open(input_path, 'r') as file:
        content = file.read()
    
    # Translate the content
    translated_content = translate_text(content, to_language=target_language)
    
    # Extract and translate the file name
    file_name = os.path.basename(input_path)
    file_name_without_ext = os.path.splitext(file_name)[0]
    file_ext = os.path.splitext(file_name)[1]
    translated_file_name = translate_text(file_name_without_ext, to_language=target_language)
    output_file_name = f"{translated_file_name}{file_ext}"
    output_path = os.path.join(output_dir, output_file_name)
    
    # Write the translated content and the original content to a new markdown file
    with open(output_path, 'w') as file:
        file.write(translated_content)
        file.write("\n\n---\n\n")  # Horizontal line in Markdown
        file.write(content)

# Example usage
input_markdown_path = 'path_to_input_markdown_file.md'
output_directory = 'path_to_output_directory'
process_markdown_file(input_markdown_path, output_directory, target_language="en")

Is there a straightforward way to run this in a Smart Rule applying the script to files matching the Smart Rule criteria? Presumably I would need to adapt the input and output paths in the scripts?

Please note that I am NOT a coder. I’m a researcher trying to strengthen my ability to research.

chrillek · May 29, 2024, 10:26am

If you’d used the JavaScript code (to which I’d linked), it would be straightforward. With this Python thing, you’d have to use an AppleScript or JavaScript wrapper that runs a shell that runs Python to run the script. Feasible. But awkward.

Also please respond to the actual post, not to yourself. That keeps the thread easy to follow.

Update I modified the original JS code so that it might be usable in a smart rule. Note that I couldn’t really test it since I don’t have access to the Azure API. In principle, it should work, but it might need some adjustments.

ObjC.import('Foundation');
ObjC.import('Cocoa');

const targetDirectory = "/users/YOU/Documents/Whatever/";

// AZURE translation API details

const AZURE_TRANSLATOR_ENDPOINT = "https://api.cognitive.microsofttranslator.com"
const TRANSLATOR_API_VERSION = "3.0"
const TRANSLATOR_REGION = "your_region"  // Replace with your Azure region

const translator_url = `${AZURE_TRANSLATOR_ENDPOINT}/translate?api-version=${TRANSLATOR_API_VERSION}`;


function performsmartrule(records)  {
  /* records is an array of the records targeted by the smart rule */
  const app = Application.currentApplication();
  app.includeStandardAdditions = true;
  records.filter(r => r.type() === "markdown").forEach(r => {
    const result = getTranslatedText(app, r.plainText());
    const targetFile = `${targetDirectory}${r.name()}`; // Use the original filename to store the translation in targetDirectory
    
    // Write the result to the file
    const file = app.openForAccess(Path(targetFile), {writePermission: true});
    app.write(text, { to: file }); 
    app.closeAccess(f);
  })
}

function getHTML(app, input) {

    // This script is a simple example of how to use the Azure OpenAI API to generate text
    // from a macOS system service written as a JavaScript for Automation (JXA) script
    // You can drop this into a Shortcuts "Run JavaScript" action (which will only work on a Mac)
    // or use it as a starting point for a more complex system service

// input is the string to translate

    const DEPLOYMENT_NAME = "my_deployment", // Set this so that you can find the API key with "security"
        // You should store your keys in the macOS Keychain and retrieve them from there:
    OPENAI_API_KEY = app.doShellScript(`security find-generic-password -w -s ${AZURE_TRANSLATOR_ENDPOINT} -a ${DEPLOYMENT_NAME}`);
    const translator_headers = {
    "Ocp-Apim-Subscription-Key": OPENAI_API_KEY,
    "Content-Type": "application/json; charset=UTF-8",
    "Ocp-Apim-Subscription-Region": TRANSLATOR_REGION
}
        postData = {
            "temperature": 0.4,
            "messages": [{
                "role": "system",
                "content": "Act as a writer. Expand the text by adding more details while keeping the same meaning. Output only the text and nothing else, do not chat, no preamble, get to the point."
                ,
            }, {
                "role": "user",
                "content": input
            }]/*,{
            role: "assistant",
            Use this if you need JSON formatting
            content: ""
        */
        },
        request = $.NSMutableURLRequest.requestWithURL($.NSURL.URLWithString(translator_url));

    request.setHTTPMethod("POST");
    request.setHTTPBody($.NSString.alloc.initWithUTF8String(JSON.stringify(postData)).dataUsingEncoding($.NSUTF8StringEncoding));
    
    Object.keys(translator_headers).forEach(k => {
      request.setValueForHTTPHeaderField(translator_headers[k], k);
    })
  
    const error = $(),
        response = $(),
        data = $.NSURLConnection.sendSynchronousRequestReturningResponseError(request, response, error);

    if (error[0]) {
        return "Error: " + error[0].localizedDescription;
    } else {
      const json = JSON.parse($.NSString.alloc.initWithDataEncoding(data, $.NSUTF8StringEncoding).js);
        if (json.error) {
            return json.error.message;
        } else {
            return json.choices[0].message.content;
        }
    }
}

jsmith · June 12, 2024, 6:13am

Again, thank you!

I tried your code above but couldn’t get it to function in a SmartRule. Kept getting errors.

I have managed to get the following JavaScript to do exactly what I want it to do using Node.js:

const axios = require('axios');
const fs = require('fs');
const path = require('path');
const { execSync } = require('child_process');

function getKeychainValue(service) {
    try {
        const result = execSync(`security find-generic-password -s ${service} -w`).toString().trim();
        return result;
    } catch (error) {
        console.error(`Error retrieving keychain value for service: ${service}`);
        return null;
    }
}

const AZURE_TRANSLATOR_ENDPOINT = "https://api.cognitive.microsofttranslator.com";
const TRANSLATOR_API_VERSION = "3.0";
const TRANSLATOR_REGION = getKeychainValue("AzureTranslatorRegion");  // Retrieve from Keychain
const TRANSLATOR_SUBSCRIPTION_KEY = getKeychainValue("AzureTranslatorKey");  // Retrieve from Keychain

async function translateText(text, fromLang, toLang) {
    const headers = {
        "Ocp-Apim-Subscription-Key": TRANSLATOR_SUBSCRIPTION_KEY,
        "Ocp-Apim-Subscription-Region": TRANSLATOR_REGION,
        "Content-Type": "application/json"
    };

    const postData = [{
        "Text": text
    }];

    const params = {
        "api-version": TRANSLATOR_API_VERSION,
        "from": fromLang,
        "to": toLang
    };

    try {
        const response = await axios.post(`${AZURE_TRANSLATOR_ENDPOINT}/translate`, postData, { headers: headers, params: params });
        const translations = response.data[0].translations;
        return translations[0].text;
    } catch (error) {
        console.error("Error translating text:", error.response ? error.response.data : error.message);
        return null;
    }
}

function sanitizeFilename(filename) {
    return filename.replace(/[^a-zA-Z0-9 \-]/g, ' ').trim();
}

function translateMarkdownFiles(directory) {
    fs.readdir(directory, (err, files) => {
        if (err) {
            console.error("Error reading directory:", err);
            return;
        }

        files.forEach(async (file) => {
            const filePath = path.join(directory, file);
            if (path.extname(file) === '.md') {
                const content = fs.readFileSync(filePath, 'utf-8');

                // Translate file content
                const translatedContent = await translateText(content, "so", "en");
                if (translatedContent) {
                    const newContent = `${translatedContent}\n\n---\n\n${content}`;
                    fs.writeFileSync(filePath, newContent, 'utf-8');
                    console.log(`Translated content of ${file} successfully.`);
                } else {
                    console.log(`Failed to translate content of ${file}.`);
                }

                // Translate file name
                const originalFilenameWithoutExtension = path.basename(file, '.md');
                const translatedFilename = await translateText(originalFilenameWithoutExtension, "so", "en");
                if (translatedFilename) {
                    const sanitizedFilename = sanitizeFilename(translatedFilename) + '.md';
                    const newFilePath = path.join(directory, sanitizedFilename);
                    fs.renameSync(filePath, newFilePath);
                    console.log(`Translated and renamed file from ${file} to ${sanitizedFilename} successfully.`);
                } else {
                    console.log(`Failed to translate filename of ${file}.`);
                }
            }
        });
    });
}

// Replace with the directory containing your Markdown files
const targetDirectory = "/Users/user/Test/md_files";

translateMarkdownFiles(targetDirectory);

Would you be able to help me adjust this for a DT3 Smart Rule?

chrillek · June 12, 2024, 7:49am

Very sketchily:

modify your Node script so that it expects a single file name on the command line
and returns the translation as a string on stdout
create a smart rule that
- acts on the MD files you want to translate
- runs a script (AppleScript or JavaScript, though I’d prefer the latter) on these files
- this script builds a shell command like node <yourscript.js> <filename.md>
- it runs this command with doShellScript, assigning the result to a string variable
- it then adds the contents of this variable to the current MD file

You get the <filename.md> from the path property of the current record. To change its content, modify its plainText property.

Though your approach of translating all files in a directory works, it is, IMO to complicated for DT: You’d have to export all MD files into a directory, run your script on that, and then re-import all the modified files into DT. Afterward, you must delete all the old files. Too convoluted for my taste, and too error-prone.

jsmith · June 12, 2024, 9:52am

I’ve modified the Node script — translate.js — as such:

const fs = require('fs');
const https = require('https');
const keytar = require('keytar');
const path = require('path');

const TRANSLATOR_REGION_SERVICE = "AzureTranslatorRegion";
const TRANSLATOR_SUBSCRIPTION_KEY_SERVICE = "AzureTranslatorKey";

const baseURL = "https://api.cognitive.microsofttranslator.com";
const apiVersion = "3.0";

async function getCredentials() {
    const region = await keytar.getPassword(TRANSLATOR_REGION_SERVICE, "default");
    const subscriptionKey = await keytar.getPassword(TRANSLATOR_SUBSCRIPTION_KEY_SERVICE, "default");

    if (!region || !subscriptionKey) {
        throw new Error("Could not retrieve necessary keys from keychain");
    }

    return { region, subscriptionKey };
}

function translateText(text, fromLang, toLang, region, subscriptionKey, callback) {
    const url = `${baseURL}/translate?api-version=${apiVersion}&from=${fromLang}&to=${toLang}`;
    const postData = JSON.stringify([{ "Text": text }]);

    const options = {
        hostname: 'api.cognitive.microsofttranslator.com',
        path: `/translate?api-version=${apiVersion}&from=${fromLang}&to=${toLang}`,
        method: 'POST',
        headers: {
            'Ocp-Apim-Subscription-Key': subscriptionKey,
            'Ocp-Apim-Subscription-Region': region,
            'Content-Type': 'application/json',
            'Content-Length': Buffer.byteLength(postData)
        }
    };

    const req = https.request(options, (res) => {
        let data = '';

        res.on('data', (chunk) => {
            data += chunk;
        });

        res.on('end', () => {
            const responseJSON = JSON.parse(data);
            const translations = responseJSON[0].translations;
            callback(translations[0].text);
        });
    });

    req.on('error', (e) => {
        console.error(`Problem with request: ${e.message}`);
        callback(null);
    });

    req.write(postData);
    req.end();
}

function removeExtension(filename) {
    return filename.substring(0, filename.lastIndexOf('.')) || filename;
}

async function main() {
    const args = process.argv;

    if (args.length < 3) {
        console.log("Usage: node translate.js <file_name>");
        return;
    }

    // Join all arguments that are part of the file name
    const fileName = args.slice(2).join(" ");

    if (!fs.existsSync(fileName)) {
        console.log("File does not exist: " + fileName);
        return;
    }

    try {
        const { region, subscriptionKey } = await getCredentials();

        fs.readFile(fileName, 'utf8', (err, data) => {
            if (err) {
                console.log("Failed to read file: " + fileName);
                return;
            }

            translateText(data, "so", "en", region, subscriptionKey, function(translatedContent) {
                if (translatedContent) {
                    const combinedContent = `${translatedContent}\n\n---\n\n${data}`;
                    // Write the combined content back to the file
                    fs.writeFile(fileName, combinedContent, 'utf8', (writeErr) => {
                        if (writeErr) {
                            console.log("Failed to write translated content to file: " + fileName);
                            return;
                        }
                        console.log("File content translated and updated.");

                        // Translate the file name
                        const originalFileName = path.basename(fileName);
                        const newFileNameBase = removeExtension(originalFileName);
                        translateText(newFileNameBase, "so", "en", region, subscriptionKey, function(translatedFileName) {
                            if (translatedFileName) {
                                const newFileName = path.join(path.dirname(fileName), translatedFileName + path.extname(fileName));
                                fs.rename(fileName, newFileName, (renameErr) => {
                                    if (renameErr) {
                                        console.log("Failed to rename file: " + fileName);
                                        return;
                                    }
                                    console.log("File renamed to: " + newFileName);
                                });
                            } else {
                                console.log("Failed to translate file name.");
                            }
                        });
                    });
                } else {
                    console.log("Translation of file content failed.");
                }
            });
        });
    } catch (error) {
        console.error("Error retrieving credentials:", error.message);
    }
}

main();

I’ve checked this works as intended.

I’m still struggling to execute the script in a Smart Rule.

I’ve tried:

(() => {
    const app = Application('DEVONthink 3');
    const selectedRecords = app.selectedRecords();

    selectedRecords.forEach(record => {
        const path = record.path(); // Get the path of the selected file
        if (path) {
            const command = `path/to/node /path/to/translate.js "${path}"`;
            const doShellScript = Application('System Events').doShellScript;
            try {
                const result = doShellScript(command);
                console.log('Result:', result);
            } catch (error) {
                console.error('Error:', error);
            }
        }
    });
})();

But no joy. Can you help me get over this final hurdle?

chrillek · June 12, 2024, 10:26am

Having a look at the default code that is shown when you add an execute script action might have helped. Or reading the documentation

Anyway, the basic structure is

function performsmartrule(records) {
  const app = Application('DEVONthink 3');
  records.forEach(r => {
   ...
})

I think checking for the existence of path is not needed if your smart rule selects only MD records – those must have a path.

And console.log will not print anything if you run the code from a smart rule. Use app.logMessage() instead and look at DT’s protocol window.

More: Remove all this “newFileNameBase”, “translatedFileName” etc. stuff. Your code should only take the input file and write the translation to stdout. No fiddling around with the file names or anything – those are internal to DT, and you must not modify those internal structures!

So, as I already said: Write translatedText to stdout. I don’t know enough about node, but perhaps console.log does that. In your smart rule script, use result to add to the markdown record’s plainText like so:

r.plainText = r.plainText() + `\n\n${result}`;

or however you need to combine the text and the result.
If you do not want to modify the original record, you should use createRecordWith… to create a new one, setting its plainText in that method call.

jsmith · July 14, 2025, 12:00pm

Coming back to this. Any suggestions for simplifying the process by asking ChatGPT to translate newly imported feeds in DT4? ChatGPT is also much better at translating Somali and Tigrinya than the above models.

I know I can just ask ChatGPT to translate the text but I want to first prune my feeds based on various keywords in filenames, and then translate feeds imported in Somali or Tigrinya while keeping the original text below the translated text in case I need to verify the translation.

I’ve been using https://rsstranslator.com/ to translate feeds which then arrive in DT already translated. But the options are limited and this requires Docker to be running. I’d prefer to streamline it all into DT if possible.

Many thanks as ever for any advice/suggestions.

cgrunenberg · July 14, 2025, 12:40pm

See get chat response for message AppleScript command of DEVONthink 4 and also Scripts > Chat > Translate Text…

chrillek · July 14, 2025, 12:40pm

What have you got so far? And what are the results (error messages? Wrong output? nothing at all?)

jsmith · July 29, 2025, 7:08pm

So I’ve gone with the following script which is now working exactly as I want it to (though may be a bit convoluted??):

property pAPIKey : "..."
property pModel : "gpt-4-turbo"
property pTargetDatabase : "Feeds"

on performSmartRule(theRecords)
	tell application id "DNtp"
		repeat with theRecord in theRecords
			try
				-- Get content
				set theHTML to source of theRecord
				set theTitle to name of theRecord
				
				-- Get the creation date of the original record
				set originalDate to creation date of theRecord
				-- Format date using AppleScript's date formatting (YY instead of YYYY)
				set yearStr to (year of originalDate as string)
				set yearStr to rich texts 3 thru 4 of yearStr -- Get last 2 digits of year
				set monthStr to (month of originalDate as integer) as string
				if length of monthStr is 1 then set monthStr to "0" & monthStr
				set dayStr to day of originalDate as string
				if length of dayStr is 1 then set dayStr to "0" & dayStr
				set hourStr to hours of originalDate as string
				if length of hourStr is 1 then set hourStr to "0" & hourStr
				set minuteStr to minutes of originalDate as string
				if length of minuteStr is 1 then set minuteStr to "0" & minuteStr
				
				set dateString to yearStr & monthStr & dayStr & "_" & hourStr & minuteStr
				
				-- Get the feed title from the parent group
				set feedTitle to ""
				try
					set parentGroup to parent 1 of theRecord
					if parentGroup is not missing value and parentGroup is not trash group then
						set feedTitle to name of parentGroup
					end if
				on error
					set feedTitle to ""
				end try
				
				-- Log for debugging
				log message "Processing: " & theTitle
				log message "  Date: " & dateString
				if feedTitle is not "" then
					log message "  From feed: " & feedTitle
				end if
				
				-- Create a Python script that handles everything
				set pythonScript to "#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import sys
import json
import urllib.request
import urllib.parse
import html
import re

api_key = '" & pAPIKey & "'
model = '" & pModel & "'

# Get the HTML content from file
with open(sys.argv[1], 'r', encoding='utf-8') as f:
    html_content = f.read()

# Extract text from HTML preserving structure
from html.parser import HTMLParser

class HTMLTextExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.text_parts = []
        self.current_text = []
        self.in_script = False
        self.in_style = False
    
    def handle_starttag(self, tag, attrs):
        if tag == 'script':
            self.in_script = True
        elif tag == 'style':
            self.in_style = True
        elif tag in ['p', 'div', 'br']:
            if self.current_text:
                self.text_parts.append(' '.join(self.current_text).strip())
                self.current_text = []
        elif tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
            if self.current_text:
                self.text_parts.append(' '.join(self.current_text).strip())
                self.current_text = []
    
    def handle_endtag(self, tag):
        if tag == 'script':
            self.in_script = False
        elif tag == 'style':
            self.in_style = False
        elif tag in ['p', 'div', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
            if self.current_text:
                self.text_parts.append(' '.join(self.current_text).strip())
                self.current_text = []
    
    def handle_data(self, data):
        if not self.in_script and not self.in_style:
            cleaned = data.strip()
            if cleaned:
                self.current_text.append(cleaned)
    
    def get_text(self):
        if self.current_text:
            self.text_parts.append(' '.join(self.current_text).strip())
        return '\\n\\n'.join([p for p in self.text_parts if p])

def sanitize_filename(filename):
    '''Clean filename to avoid filesystem issues'''
    # Remove HTML entities first
    filename = html.unescape(filename)
    
    # Replace problematic characters with safe alternatives
    replacements = {
        '/': '-',
        '\\\\': '-',
        ':': '-',
        '*': '',
        '?': '',
        '\"': '',
        '<': '',
        '>': '',
        '|': '-',
        '&': 'and',
        '#': '',
        '%': '',
        '{': '(',
        '}': ')',
        '$': '',
        '!': '',
        '@': 'at',
        '+': 'plus',
        '`': '',
        '=': '-',
        '[': '(',
        ']': ')',
        ';': '-',
        '\\'': '',
        ',': '',
        '~': '-'
    }
    
    for old, new in replacements.items():
        filename = filename.replace(old, new)
    
    # Replace multiple spaces with single space
    filename = ' '.join(filename.split())
    
    # Replace periods with spaces (except for the last one if it exists)
    parts = filename.split('.')
    if len(parts) > 1:
        # Keep everything except the last part, replace dots with spaces
        filename = ' '.join(parts[:-1]) + '.' + parts[-1]
    else:
        filename = parts[0]
    
    # Remove any remaining periods that aren't followed by an extension
    filename = re.sub(r'\\.(?![a-zA-Z]{2,4}$)', ' ', filename)
    
    # Replace multiple spaces/dashes with single ones
    filename = re.sub(r'\\s+', ' ', filename)
    filename = re.sub(r'-+', '-', filename)
    
    # Remove leading/trailing spaces and dashes
    filename = filename.strip(' -')
    
    # Ensure filename isn't empty
    if not filename or filename == '.':
        filename = 'Untitled'
    
    # Limit length (leave room for date prefix and extension)
    if len(filename) > 180:
        filename = filename[:180].strip()
    
    return filename

parser = HTMLTextExtractor()
parser.feed(html_content)
text_content = parser.get_text()

# Get title from argument
title = sys.argv[2] if len(sys.argv) > 2 else 'Document'

# Prepare the API request
headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {api_key}'
}

try:
    # First, detect language - be more specific about the languages we're looking for
    lang_data = {
        'model': model,
        'messages': [
            {'role': 'system', 'content': 'You are a language detection expert. Reply with ONLY one of these exact words: English, Somali, Tigrinya, Amharic, or Unknown'},
            {'role': 'user', 'content': f'What language is this text? Choose from: English, Somali, Tigrinya, Amharic. Text: \"{text_content[:500]}\"'}
        ],
        'temperature': 0.1,
        'max_tokens': 10
    }
    
    req = urllib.request.Request(
        'https://api.openai.com/v1/chat/completions',
        data=json.dumps(lang_data).encode('utf-8'),
        headers=headers
    )
    
    response = urllib.request.urlopen(req)
    result = json.loads(response.read().decode('utf-8'))
    
    detected_language = 'Unknown'
    if 'choices' in result and len(result['choices']) > 0:
        detected_language = result['choices'][0]['message']['content'].strip()
    
    # Normalize the detected language
    detected_language_lower = detected_language.lower()
    
    # Check if translation is needed (anything except English)
    needs_translation = detected_language_lower not in ['english', 'unknown']
    
    if needs_translation:
        # Translate content
        content_data = {
            'model': model,
            'messages': [
                {'role': 'system', 'content': f'You are a professional {detected_language} to English translator. Translate the text while preserving paragraph breaks.'},
                {'role': 'user', 'content': f'Translate this {detected_language} text to English, preserving all paragraph breaks: {text_content}'}
            ],
            'temperature': 0.3,
            'max_tokens': 4000
        }
        
        req = urllib.request.Request(
            'https://api.openai.com/v1/chat/completions',
            data=json.dumps(content_data).encode('utf-8'),
            headers=headers
        )
        
        response = urllib.request.urlopen(req, timeout=120)
        result = json.loads(response.read().decode('utf-8'))
        
        if 'choices' in result and len(result['choices']) > 0:
            translated_content = result['choices'][0]['message']['content']
        else:
            translated_content = 'Translation failed'
        
        # Translate title
        title_data = {
            'model': model,
            'messages': [
                {'role': 'system', 'content': f'You are a {detected_language} to English translator. Provide ONLY the English translation with no explanations, quotes, or additional text.'},
                {'role': 'user', 'content': f'Translate to English: {title}'}
            ],
            'temperature': 0.3,
            'max_tokens': 100
        }
        
        req2 = urllib.request.Request(
            'https://api.openai.com/v1/chat/completions',
            data=json.dumps(title_data).encode('utf-8'),
            headers=headers
        )
        
        response2 = urllib.request.urlopen(req2)
        result2 = json.loads(response2.read().decode('utf-8'))
        
        if 'choices' in result2 and len(result2['choices']) > 0:
            translated_title = result2['choices'][0]['message']['content'].strip()
            # Clean up the title
            translated_title = translated_title.strip('\"').strip('\\'').strip()
        else:
            translated_title = title
        
        # Sanitize the filename
        translated_title = sanitize_filename(translated_title)
        
        # Create bilingual markdown output - using generic ORIGINAL TEXT header
        output_markdown = f'''# {translated_title}

{translated_content}

---

## ORIGINAL TEXT

# {title}

{text_content}'''
    else:
        # Content is already in English or Unknown, just convert to markdown
        translated_title = sanitize_filename(title)
        output_markdown = f'''# {title}

{text_content}'''
    
    # Output results for AppleScript parsing only
    print('TITLE:' + translated_title)
    print('CONTENT:' + output_markdown)
    print('LANG:' + detected_language)
    print('TRANSLATED:' + ('YES' if needs_translation else 'NO'))
    
except Exception as e:
    if 'maximum context length' in str(e):
        print('ERROR:Content too long. Consider processing in chunks.')
    else:
        print(f'ERROR:{str(e)}')
    sys.exit(1)
"
				
				-- Save Python script
				set pythonFile to (POSIX path of (path to temporary items)) & "translate_feeds.py"
				try
					set fileRef to open for access pythonFile with write permission
					set eof fileRef to 0
					write pythonScript to fileRef as «class utf8»
					close access fileRef
				on error
					try
						close access fileRef
					end try
				end try
				
				-- Save HTML content to file
				set htmlFile to (POSIX path of (path to temporary items)) & "content.html"
				try
					set fileRef to open for access htmlFile with write permission
					set eof fileRef to 0
					write theHTML to fileRef as «class utf8»
					close access fileRef
				on error
					try
						close access fileRef
					end try
				end try
				
				-- Run Python script
				set scriptResult to do shell script "python3 " & quoted form of pythonFile & " " & quoted form of htmlFile & " " & quoted form of theTitle
				
				-- Check for errors
				if scriptResult starts with "ERROR:" then
					error (rich texts 7 thru -1 of scriptResult)
				end if
				
				-- Parse results carefully to avoid including debug info
				set AppleScript's text item delimiters to "CONTENT:"
				set resultParts to text items of scriptResult
				set translatedTitle to rich texts 7 thru -1 of (item 1 of resultParts) -- Remove "TITLE:"
				
				-- Extract just the content part, stopping before LANG:
				set contentPart to item 2 of resultParts
				if contentPart contains "LANG:" then
					set AppleScript's text item delimiters to "LANG:"
					set contentOnlyParts to text items of contentPart
					set translatedContent to item 1 of contentOnlyParts
				else
					set translatedContent to contentPart
				end if
				
				-- Additional AppleScript sanitization for title
				set AppleScript's text item delimiters to ""
				-- Remove any remaining problematic characters
				set badChars to {":", "/", "\\", "*", "?", "\"", "<", ">", "|"}
				repeat with badChar in badChars
					set AppleScript's text item delimiters to badChar
					set titleParts to text items of translatedTitle
					set AppleScript's text item delimiters to "-"
					set translatedTitle to titleParts as string
				end repeat
				
				-- Reset delimiter
				set AppleScript's text item delimiters to ""
				
				-- Final trim
				set translatedTitle to do shell script "echo " & quoted form of translatedTitle & " | xargs"
				
				-- Ensure no double periods before extension
				if translatedTitle contains ".." then
					set AppleScript's text item delimiters to ".."
					set titleParts to text items of translatedTitle
					set AppleScript's text item delimiters to "."
					set translatedTitle to titleParts as string
					set AppleScript's text item delimiters to ""
				end if
				
				-- Prepend the date to the filename
				set finalFilename to dateString & " " & translatedTitle
				
				-- Check if we have language info (for logging only)
				set detectedLang to "unknown"
				set wasTranslated to false
				if scriptResult contains "LANG:" then
					set AppleScript's text item delimiters to "LANG:"
					set langParts to text items of scriptResult
					set langInfo to item 2 of langParts
					if langInfo contains "TRANSLATED:" then
						set AppleScript's text item delimiters to "TRANSLATED:"
						set langSubParts to text items of langInfo
						set detectedLang to rich texts 1 thru -1 of item 1 of langSubParts
						if item 2 of langSubParts contains "YES" then
							set wasTranslated to true
						end if
					else
						set detectedLang to langInfo
					end if
				end if
				
				set AppleScript's text item delimiters to ""
				
				-- Create a new markdown record with date-prefixed filename
				set mdRecord to create record with {name:finalFilename, type:markdown, content:translatedContent} in parent 1 of theRecord
				
				-- Copy all metadata
				set URL of mdRecord to URL of theRecord
				
				-- Build tags list: existing tags plus feed title
				set existingTags to tags of theRecord
				if feedTitle is not "" and feedTitle is not missing value then
					if existingTags is missing value or existingTags is {} then
						set tags of mdRecord to {feedTitle}
					else
						set tags of mdRecord to existingTags & {feedTitle}
					end if
				else
					if existingTags is not missing value then
						set tags of mdRecord to existingTags
					end if
				end if
				
				set creation date of mdRecord to creation date of theRecord
				set modification date of mdRecord to current date
				
				-- Move to target database
				set moveSuccess to false
				try
					-- Get all databases
					set allDatabases to databases
					set targetDB to missing value
					
					-- Find the target database
					repeat with aDatabase in allDatabases
						if name of aDatabase is pTargetDatabase then
							set targetDB to aDatabase
							exit repeat
						end if
					end repeat
					
					if targetDB is not missing value then
						-- Move the record
						move record mdRecord to incoming group of targetDB
						set moveSuccess to true
						log message "Successfully moved to database: " & pTargetDatabase
					else
						log message "Target database '" & pTargetDatabase & "' not found."
					end if
				on error moveError
					log message "Move error: " & moveError
				end try
				
				-- Delete the original HTML record
				try
					delete record theRecord
					log message "Deleted original HTML file"
				on error deleteError
					log message "Could not delete original: " & deleteError
				end try
				
				-- Clean up temp files
				try
					do shell script "rm -f " & quoted form of pythonFile & " " & quoted form of htmlFile
				end try
				
				if wasTranslated then
					log message "Successfully translated: " & theTitle & " → " & finalFilename & " (Language: " & detectedLang & ")"
				else
					log message "Successfully converted: " & theTitle & " → " & finalFilename & " (Language: " & detectedLang & ")"
				end if
				
			on error errMsg
				log message "Error processing '" & theTitle & "': " & errMsg
			end try
		end repeat
	end tell
end performSmartRule

BLUEFROG · July 29, 2025, 8:37pm

If it makes sense to you, you can maintain or fix it, and it provides your desired results, then play on.

cgrunenberg · July 30, 2025, 6:26am

Any particular reason why you’re still using gpt-4-turbo? It’s an old, expensive and limited model, e.g. 4 times more expensive than gpt-4.1 and 20 times more expensive than gpt-4.1-mini which is sufficient for many tasks.

Anyway, here’s a simplified example for DEVONthink 4:

property pRole : "You are a professional translator"
property pPromptTitle : "Translate the following text to English. Return only the translated text but nothing else. " & return
property pPromptMarkdown : "Translate the following text to English. Return only the translated text but nothing else. Preserve all paragraphs, the Markdown formatting and all links." & return

tell application id "DNtp"
	set numSel to count selected records
	if numSel is not 0 then
		show progress indicator "Translating" steps numSel
		repeat with theRecord in selected records
			set theTitle to name of theRecord
			step progress indicator theTitle
			set theConvertedRecord to convert record theRecord to markdown
			if theConvertedRecord is not missing value then
				set theMarkdown to plain text of theConvertedRecord
				set theTranslatedMarkdown to get chat response for message (pPromptMarkdown & "<text>" & theMarkdown & "</text>") role pRole without tool calls and thinking
				update record theConvertedRecord with text (theTranslatedMarkdown & return & return & "---" & return & return) mode inserting
				set theTranslatedTitle to get chat response for message (pPromptTitle & "<text>" & theTitle & "</text>") role pRole without tool calls and thinking
				set name of theConvertedRecord to theTranslatedTitle
			end if
		end repeat
		hide progress indicator
	end if
end tell

jsmith · July 30, 2025, 1:11pm

I have found that 4-turbo is best suited for low resource and complex languages such as Somali and Tigirinya.

Thank you all as ever!