API Reference
Inference API - Bijak Cloud Docs
Chat completions, embeddings, and streaming responses on the Bijak Cloud inference API.
Overview
The inference API exposes chat completions, embeddings, and streaming responses. All endpoints are OpenAI-API-compatible, which means most existing SDKs and tools work without modification.
Chat completions
curl -X POST https://api.bijakcloud.example/v1/inference/chat \
-H "Authorization: Bearer $BIJAK_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "bijak-merlion-13b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is PDPA?"}
],
"temperature": 0.2,
"max_tokens": 512
}'
The response follows the OpenAI chat-completion shape: a choices array with message and finish_reason, plus usage with prompt and completion token counts.
Streaming
Pass "stream": true to receive tokens as they arrive:
curl -X POST https://api.bijakcloud.example/v1/inference/chat \
-H "Authorization: Bearer $BIJAK_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "bijak-merlion-13b",
"messages": [{"role":"user","content":"Summarise PDPA"}],
"stream": true
}'
The response is a server-sent-event stream with data: { ... } lines terminated by data: [DONE]. Each event includes a delta with the new token.
Embeddings
Generate embeddings for documents and queries:
curl -X POST https://api.bijakcloud.example/v1/inference/embeddings \
-H "Authorization: Bearer $BIJAK_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "bijak-embed-1024",
"input": ["How does PDPA apply to AI workloads?", "What is sovereign inference?"]
}'
Each input string produces a 1024-dimensional vector. Inputs may be strings or arrays of strings.
Models
The default model is bijak-merlion-13b for chat and bijak-embed-1024 for embeddings. Both are pre-trained on Malaysian context and ship with Bahasa Melayu tokenizers. A full model list is available at GET /v1/models.
Rate limits
Rate limits are per workspace and per API key. The default tier is 60 requests per minute and 100,000 tokens per minute. Limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) are returned on every response.
Audit logging
Every inference call produces an audit log entry that includes the model, the prompt hash (not the prompt itself), the user identity, the token count, and the latency. Logs are append-only and exportable.
Next steps
- Read the RAG API reference for grounded completions.
- Review Auth for OAuth2 and service-account patterns.
- See Concepts: Sovereignty for the compliance posture.