Usage guide
This guide covers the core concepts and patterns for using the vllora_llm crate effectively.
Basic usage: completions client (gateway-native)
The main entrypoint is VlloraLLMClient, which gives you a CompletionsClient for chat completions using the gateway-native request/response types.
use std::sync::Arc;
use vllora_llm::client::{VlloraLLMClient, ModelInstance, DummyModelInstance};
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;
#[tokio::main]
async fn main() -> LLMResult<()> {
// In production you would pass a real ModelInstance implementation
// that knows how to call your configured providers / router.
let instance: Arc<Box<dyn ModelInstance>> = Arc::new(Box::new(DummyModelInstance {}));
// Build the high-level client
let client = VlloraLLMClient::new_with_instance(instance);
// Build a simple chat completion request
let request = ChatCompletionRequest {
model: "gpt-4.1-mini".to_string(), // or any gateway-configured model id
messages: vec![
ChatCompletionMessage::new_text(
"system".to_string(),
"You are a helpful assistant.".to_string(),
),
ChatCompletionMessage::new_text(
"user".to_string(),
"Say hello in one short sentence.".to_string(),
),
],
..Default::default()
};
// Send the request and get a single response message
let response = client.completions().create(request).await?;
let message = response.message();
if let Some(content) = &message.content {
if let Some(text) = content.as_string() {
println!("Model reply: {text}");
}
}
Ok(())
}
Key pieces:
VlloraLLMClient: wraps aModelInstanceand exposes.completions().CompletionsClient::create: sends a one-shot completion request and returns aChatCompletionMessageWithFinishReason.- Gateway types (
ChatCompletionRequest,ChatCompletionMessage) abstract over provider-specific formats.
Streaming completions
CompletionsClient::create_stream returns a ResultStream that yields streaming chunks:
use std::sync::Arc;
use vllora_llm::client::{VlloraLLMClient, ModelInstance, DummyModelInstance};
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;
#[tokio::main]
async fn main() -> LLMResult<()> {
let instance: Arc<Box<dyn ModelInstance>> = Arc::new(Box::new(DummyModelInstance {}));
let client = VlloraLLMClient::new_with_instance(instance);
let request = ChatCompletionRequest {
model: "gpt-4.1-mini".to_string(),
messages: vec![ChatCompletionMessage::new_text(
"user".to_string(),
"Stream the alphabet, one chunk at a time.".to_string(),
)],
..Default::default()
};
let mut stream = client.completions().create_stream(request).await?;
while let Some(chunk) = stream.next().await {
let chunk = chunk?;
for choice in chunk.choices {
if let Some(delta) = choice.delta.content {
print!("{delta}");
}
}
}
Ok(())
}
The stream API mirrors OpenAI-style streaming but uses gateway-native ChatCompletionChunk types.
Supported parameters
The table below lists which ChatCompletionRequest (and provider-specific) parameters are honored by each provider when using VlloraLLMClient:
| Parameter | OpenAI / Proxy | Anthropic | Gemini | Bedrock | Notes |
|---|---|---|---|---|---|
model | yes | yes | yes | yes | Taken from ChatCompletionRequest.model or engine config. |
max_tokens | yes | yes | yes | yes | Mapped to provider-specific max_tokens / max_output_tokens. |
temperature | yes | yes | yes | yes | Sampling temperature. |
top_p | yes | yes | yes | yes | Nucleus sampling. |
n | no | no | yes | no | For Gemini, mapped to candidate_count; other providers always use n = 1. |
stop / stop_sequences | yes | yes | yes | yes | Converted to each providers' stop / stop-sequences field. |
presence_penalty | yes | no | yes | no | OpenAI / Gemini only. |
frequency_penalty | yes | no | yes | no | OpenAI / Gemini only. |
logit_bias | yes | no | no | no | OpenAI-only token bias map. |
user | yes | no | no | no | OpenAI "end-user id" field. |
seed | yes | no | yes | no | Deterministic sampling where supported. |
response_format (JSON schema, etc.) | yes | no | yes | no | Gemini additionally normalizes JSON schema for its API. |
prompt_cache_key | yes | no | no | no | OpenAI-only prompt caching hint. |
provider_specific.top_k | no | yes | no | no | Anthropic-only: maps to Claude top_k. |
provider_specific.thinking | no | yes | no | no | Anthropic "thinking" options (e.g. budget tokens). |
Bedrock additional_parameters map | no | no | no | yes | Free-form JSON, passed through to Bedrock model params. |
Additionally, for Anthropic, the first system message in the conversation is mapped into a SystemPrompt (either as a single text string or as multiple TextContentBlocks), and any cache_control options on those blocks are translated into Anthropic's ephemeral cache-control settings.
All other fields on ChatCompletionRequest (such as stream, tools, tool_choice, functions, function_call) are handled at the gateway layer and/or per-provider tool integration, but are not mapped 1:1 into provider primitive parameters.
Notes
- Real usage: In the full LangDB / Vllora gateway, concrete
ModelInstanceimplementations are created by the core executor based on yourmodels.yamland routing rules; the examples above useDummyModelInstanceonly to illustrate the public API of theCompletionsClient. - Error handling: All client methods return
LLMResult<T>, which wraps richLLMErrorvariants (network, mapping, provider errors, etc.). - More features: The same types in
vllora_llm::types::gatewayare used for tools, MCP, routing, embeddings, and image generation; see the main repository docs athttps://vllora.dev/docsfor higher-level gateway features.
Roadmap and issues
- GitHub issues / roadmap: See open LLM crate issues for planned and outstanding work.
- Planned enhancements:
- Support builtin MCP tool calls
- Gemini prompt caching supported
- Full thinking messages support
Note: The Responses API is now available! See the Responses API documentation for comprehensive guides and examples.