Skip to main content

vLLora Changelog

0.1.20

vLLora now includes a CLI tool that brings trace inspection and debugging capabilities directly to your terminal. The CLI enables fast iteration, automation workflows, and local reproduction of LLM traces without leaving your terminal.

The CLI provides commands to:

  • Search and filter traces by status, time range, model, and operation type
  • Get run overviews with span trees and LLM call summaries
  • Inspect individual LLM call payloads and responses
  • Monitor system health with aggregated statistics

Learn more in the vLLora CLI documentation.

This release also introduces Custom Providers and Models, allowing you to register your own API endpoints and model identifiers. Connect to self-hosted inference engines (like Ollama or LocalAI), private enterprise gateways, or any OpenAI-compatible service using a namespaced format (provider/model-id). Configure providers and models through Settings or the Chat Model Selector.

Learn more in the Custom Providers and Models documentation.

0.1.19

vLLora now includes an MCP Server that exposes trace and run inspection as tools for coding agents. Debug, fix, and monitor your AI agents directly from your terminal or IDE by connecting Claude Desktop, Cursor, or any MCP-capable client to vLLora's MCP endpoint.

The MCP server provides tools to:

  • Search and filter traces by status, time range, model, and more
  • Get run overviews with span trees and error breadcrumbs
  • Inspect individual LLM call payloads and responses
  • Monitor system health with aggregated statistics

Learn more in the MCP Server documentation.

0.1.18

vLLora now supports Custom Endpoints, allowing you to connect your own API endpoints to any provider. Simply provide your endpoint URL and API key through the Provider Keys UI, and vLLora will route requests to your custom endpoint instead of the default provider endpoint.

Custom Endpoint UI

This feature enables you to:

  • Use custom API gateways and proxies
  • Connect to self-hosted models
  • Route through OpenAI-compatible endpoints

Learn more in the Custom Endpoints documentation.

0.1.16

Fixed bugs improving UI stability, cost calculations, and API endpoint behavior. Key improvements include fixes for debug mode span display, visual diagram interactions, and pagination handling.

0.1.15

Introducing Debug Mode, an interactive debugging feature that lets you pause LLM requests before they're sent to the model. With debug mode enabled, you can inspect the full request payload, edit messages, parameters, and tool schemas in real time, then continue execution with your modifications—all without changing your application code.

Debug Mode in action

Debug mode is perfect for debugging agent prompts, verifying model selection, inspecting tool schemas, and tuning parameters on the fly. Simply enable the breakpoint toggle in the Traces view, and every outgoing LLM request will pause for inspection and editing.

With debug mode you can:

  • Inspect the model, messages, parameters, and tool schemas
  • Continue with the original request
  • Modify the request and send your edited version instead

Learn more in the Debug Mode documentation.

0.1.14

Introducing vllora_llm crate, a standalone Rust library that provides a unified interface for interacting with multiple LLM providers through the vLLora AI Gateway. The crate enables seamless chat completions across OpenAI-compatible, Anthropic, Gemini, and Bedrock providers, with built-in streaming support and telemetry integration.

use vllora_llm::client::VlloraLLMClient;
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let request = ChatCompletionRequest {
model: "gpt-4.1-mini".to_string(),
messages: vec![
ChatCompletionMessage::new_text("user".to_string(), "Say hello!".to_string()),
],
..Default::default()
};

let client = VlloraLLMClient::new();
let response = client.completions().create(request).await?;
Ok(())
}

Other improvements in this release

  • Enhanced breakpoint management with GlobalBreakpointStateEvent integration
  • Improved error handling for multiple provider scenarios
  • Asynchronous improvements to intercept functionality

0.1.13

General bug fixes and improvements.

0.1.12

Introducing Clone Request & Experiments, a powerful new feature that enables you to A/B test prompts, compare models, and iterate on LLM requests directly from your traces. Clone any finished trace into an isolated experiment where you can safely tweak parameters, switch models, or modify prompts without affecting the original request.

The Experiment feature provides two editing modes: a Visual Editor for intuitive prompt tweaking and a JSON Editor for precise parameter control. Edit system and user messages, switch models on the fly, adjust temperature and other parameters, and run experiments with side-by-side comparison of tokens, costs, and outputs—all without ever touching your original trace.

Experimenting with visual editor

Perfect for prompt engineering, model comparison, parameter tuning, and iterative debugging. Learn more in the Clone and Experiment documentation.

Other improvements in this release:

  • Enhanced breakpoint management with global intercept and resume capabilities
  • Improved tracing with better parent span ID handling and status recording
  • Stream implementation refactoring for better LLM provider support
  • Better output recording and model finish reason handling

0.1.11

  • Enhanced error handling for port conflicts during startup.
  • Stream processing improvements with better tracing instrumentation for streaming responses.

0.1.10

  • Added thought signature support for Gemini 3 Pro models.
  • General bug fixes and improvements.

0.1.9

  • General bug fixes and improvements.

0.1.8

  • General bug fixes and improvements.

0.1.7

Introducing MCP Support, enabling seamless integration with Model Context Protocol servers. Connect your AI models to external tools, APIs, databases, and services through HTTP, SSE, or WebSocket transports. vLLora automatically discovers MCP tools, executes tool calls on your behalf, and traces all interactions—making it easy to extend your models with dynamic capabilities.

MCP Configuration in Settings

Other improvements in this release

  • Full MCP server support with HTTP, SSE, and WebSocket transports, enabling dynamic tool execution and external system integration
  • Embedding model support for Bedrock, Gemini, and OpenAI with comprehensive tracing and cost tracking
  • Enhanced routing with conditional strategies, fallbacks, and maximum depth limits for complex request flows
  • Improved cost tracking with cached input token pricing and enhanced usage monitoring across all providers
  • Response caching improvements for better performance and cost optimization
  • Thread handling with service integration and middleware support for conversation management
  • Multi-tenant OpenTelemetry tracing with tenant-aware span management
  • Claude Sonnet 4.5 model support
  • Virtual model versioning via model@version syntax for flexible model selection
  • Variables support in chat completions for dynamic prompt templating
  • Enhanced model metadata with service level, release date, license, and knowledge cutoff information
  • Google Vertex AI model fetching and integration
  • Improved span management with RunSpanBuffer for efficient trace processing

0.1.6

  • General bug fixes and improvements.

0.1.5

  • General bug fixes and improvements.

0.1.4

  • General bug fixes and improvements.

0.1.0

  • General bug fixes and improvements.