Configuring Open AI Compatible Endpoint

I am attempting to configure an OpenAI compatible endpoint but I am unable to enter the model name manually.

I understand from a prior post that DT4 needs to retreive the model list in order to determine model capability and that you look at a /v1/models endpoint.

In the case of cerebras.ai there is indeed a /v1/models endpoint but it requires authentication via the API key:

https://api.cerebras.ai/v1/models

Alternatively there is a public models endpoint that requires no authentication:

https://api.cerebras.ai/public/v1/models

Is it possible to support one of these /models endpoints? Or alternatively let the user configure the tools and other metadata at his own risk?

I understand from a prior post that DT4 needs to retreive the model list in order to determine model capability and that you look at a /v1/models endpoint.

As mentioned many times, we can’t add support for every AI provider and aggregator that pops up online. People need to investigate if there’s an OpenAI compatible endpoint and what the correct URL is. You should be looking for chat completion endpoints. In this case, the actual endpoint appears to be https://api.cerebras.ai/public/v1/chat/completions but we logically can’t say what kind of access that really allows.

Hi @BLUEFROG

The endpoint In the image is indeed the working endpoint - I have it working in another app:

https://api.cerebras.ai/v1

The issue as I understand it from the prior forum post is that DT4 looks for a “V1/models” endpoint to get metadata about the model. For Cerebras that endpoint is instead “/public/V1/models”

I totally understand you cannot support every model; that is why I suggested you could simply leave it to the user to configure the tools and other metadata if DT4 does not find the models endpoint.

While you cannot support “every model” I believe the performance advantage of Cerebras hardeware is profound - at a really low cost.

Typical performance I get in Devonthink on OpenRouter for most models is at best 50-60 tokens per second:

Typical performance I get in Devonthink on OpenRouter for gpt-oss-120b when OpenRouter is configured for best throughput is 300 tokens per second:

Now that alone is frankly stunning - and it is really impressive when using Chat in Devonthink. This is an inexpensive model which yields a response about 5-6 times faster than typical models. Impressive.

But Devonthink does not let you specify a particular Openrouter provider other than generically “best throughput”. When I use my own app and either specify the Openrouter provider or directly access the Cerebrex API this is what I get:

So that is an increase from 60 tokens per second for typical Devonthink models to over 2100 tokens per second. 35 times faster - real world. Cost $0.35/M input, $0.75/M output. They advertise a theoretical 3000 tokens/second maximum - I am getting 2100 tokens per second real-world. In my custom app I am conversing with a 1,000 page PDF Chatbox and getting back sophisticated answers faster than I can read them.

I do not think this is a niche use case - the speed/cost are likely of interest to most who use AI in DT4, and I suspect the development work for DevonTech to implement Cerebras support would be trivial.

DEVONthink expects the completions endpoint, not the APIs base URL.

OK - simple solution :slight_smile:. [@BLUEFROG was initially correct - I should have tried that earlier. Thank you both.]

Performance is beyond stunning - at minimal cost. Definitely something I would recommend other users try.

For documents within the 131K context window response is instant using DT4.

For documents or document sets beyond the context window size is there a way to still use DT4? I have used a custom app which combines Cerebras with Langchain Refine Chain - that allows context of almost any length via iteration.

Only by manually switching the provider/model.

Well 131K context size is decent enough in many cases - definitely worth trying out. Super impressive performance of that Chat with this model - Cerebras hardware does indeed achieve thousands of tokens per second.