vLLM

The vLLM plugin lets formae declaratively manage LoRA adapters on a running vLLM server and discover the base models it serves. It is built for edge and sovereign inference: formae continuously reconciles what a vLLM node serves and detects out-of-band drift — the hard part of running disconnected, customer-owned inference fleets on air-gapped, pre-existing GPU hardware.

The plugin assumes vLLM is already running; it does not provision hosts or GPUs. Its target is an already-running, OpenAI-compatible vLLM endpoint, which it manages over HTTP.

Repository: formae-plugin-vllm

Installation

sudo formae plugin install vllm

This installs the plugin into the agent's plugin tree (/opt/pel/formae/plugins/vllm/). The agent picks it up on next startup; restart the agent if it's already running.

The plugin is not bundled with the base formae agent image. For cloud-deployed agents (ECS, ACI, Cloud Run, Helm/K8s), bake the plugin into a derived image instead. See formae plugin for the full command reference, version pinning, and batch installs.

Configuration

Target

Configure a target with the vLLM node's OpenAI base URL:

import "@formae/formae.pkl"
import "@vllm/core/vllm.pkl"

new formae.Target {
    label = "local-vllm"
    config = new vllm.Config {
        baseUrl = "http://localhost:8000"
    }
}
Config key Required Notes
baseUrl yes vLLM OpenAI base URL, e.g. http://<node>:8000

Credentials

An optional bearer token is read from the VLLM_API_KEY environment variable (sent as Authorization: Bearer <key>); it is intentionally not part of the forma. Leave it unset for an unauthenticated server.

vLLM server prerequisites

The server must be started with --enable-lora and the environment variable VLLM_ALLOW_RUNTIME_LORA_UPDATING=True so that the /v1/load_lora_adapter and /v1/unload_lora_adapter endpoints are accepted.

Once an adapter is loaded, vLLM exposes it as its own model id: consumers call /v1/chat/completions with "model": "<loraName>" and vLLM routes through the base model plus the adapter weights. The base model remains addressable by its own id.

Offline behavior

Edge nodes are intermittently connected, so this is first-class behavior. An unreachable node (connection refused, timeout, DNS, or TLS failure) is reported as unreachable — a recoverable error that is retried — and is never mistaken for a deleted adapter. Offline ≠ deleted.

A resource is reported as missing only on a positive, authoritative absence (the node responded and the adapter is genuinely no longer served), which lets background sync remove an out-of-band-unloaded adapter from inventory. Restoring such an adapter is a matter of re-applying the source forma — re-apply is idempotent: it loads the adapter if missing and no-ops if present.

Examples

Examples live in the plugin repository. Clone the repo and resolve Pkl dependencies before running:

git clone https://github.com/platform-engineering-labs/formae-plugin-vllm.git
cd formae-plugin-vllm
pkl project resolve examples/local

Available examples:

Example Description
local Run vLLM locally on a GPU via docker-compose and manage an adapter on it
kubernetes vLLM provisioned by Kubernetes (Deployment + PVC + Service); formae manages both the workload and the adapters loaded on top
aws Bring up a GPU box with the formae AWS plugin, then manage the adapter on it (billable; apply manually)
# Evaluate an example
formae eval examples/local/forma.pkl

# Apply resources
formae apply --mode reconcile --watch examples/local/forma.pkl

Supported Resources

Type Description Native ID Discoverable Extractable
VLLM::Inference::LoRAAdapter Dynamically-loaded LoRA adapter on a running vLLM server (full CRUD) loraName Yes No
VLLM::Inference::Model Base model served by a vLLM node — observe/discover only (set at vLLM startup, not via the API) id Yes No

For VLLM::Inference::LoRAAdapter, loraName and baseModelName are create-only — changing either triggers a replacement — while loraPath is updated in place (reload).

What's next

Release notes

See release notes for changes per version.