Inference API - Bijak Cloud Docs

Overview

The inference API exposes chat completions, embeddings, and streaming responses. All endpoints are OpenAI-API-compatible, which means most existing SDKs and tools work without modification.

Chat completions

curl -X POST https://api.bijakcloud.example/v1/inference/chat \
  -H "Authorization: Bearer $BIJAK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bijak-merlion-13b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is PDPA?"}
    ],
    "temperature": 0.2,
    "max_tokens": 512
  }'

The response follows the OpenAI chat-completion shape: a choices array with message and finish_reason, plus usage with prompt and completion token counts.

Streaming

Pass "stream": true to receive tokens as they arrive:

curl -X POST https://api.bijakcloud.example/v1/inference/chat \
  -H "Authorization: Bearer $BIJAK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bijak-merlion-13b",
    "messages": [{"role":"user","content":"Summarise PDPA"}],
    "stream": true
  }'

The response is a server-sent-event stream with data: { ... } lines terminated by data: [DONE]. Each event includes a delta with the new token.

Embeddings

Generate embeddings for documents and queries:

curl -X POST https://api.bijakcloud.example/v1/inference/embeddings \
  -H "Authorization: Bearer $BIJAK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bijak-embed-1024",
    "input": ["How does PDPA apply to AI workloads?", "What is sovereign inference?"]
  }'

Each input string produces a 1024-dimensional vector. Inputs may be strings or arrays of strings.

Models

The default model is bijak-merlion-13b for chat and bijak-embed-1024 for embeddings. Both are pre-trained on Malaysian context and ship with Bahasa Melayu tokenizers. A full model list is available at GET /v1/models.

Rate limits

Rate limits are per workspace and per API key. The default tier is 60 requests per minute and 100,000 tokens per minute. Limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) are returned on every response.

Audit logging

Every inference call produces an audit log entry that includes the model, the prompt hash (not the prompt itself), the user identity, the token count, and the latency. Logs are append-only and exportable.

Next steps

Read the RAG API reference for grounded completions.
Review Auth for OAuth2 and service-account patterns.
See Concepts: Sovereignty for the compliance posture.