With GPT4 at least, when instructed in the system prompt with something like “use the language of the content for the language of the filename”, it handles it correctly, but I didn’t do too many tests because I like to have all my stuff in English (and Japanese)
Have you tinkered with different quantizations? I think for llama2 using the chat models is always going to have very poor results, better to use the normal text (or instruct) models, but even then I just couldn’t get it to reliably do what I wanted…
Anyway I’ll create a new thread with the scripts that I used
But just using ChatGPT over llama gave a million times better results with little prompt tinkering that I kinda just wanna roll with that for now haha
Another case I would looove to try sometime would be automatic grouping and classification. So find something common in all the documents and then group them into logical groups, similar to the auto-group feature DT had in the past.
Was thinking multiple passes for this, like:
- Generate a short summary, or a bunch of keywords and store it as annotation
- Chain all the short summaries/keywords together
- Send to ChatGPT for analysis
Problem is that the context window is just way too small for bigger text blobs, and esp on gpt4 that gets very expensive. The 3.5-16k one is better but still expensive just for grouping some files, so need to do a couple of passes to make it as compact as possible first
If llama2 can be fine-tuned for document naming and grouping (which I’m sure it can), that would be perfect, no more worrying about cost