vLLM
The vLLM plugin lets formae declaratively manage LoRA adapters on a running vLLM server and discover the base models it serves. It is built for edge and sovereign inference: formae continuously reconciles what a vLLM node serves and detects out-of-band drift — the hard part of running disconnected, customer-owned inference fleets on air-gapped, pre-existing GPU hardware.
The plugin assumes vLLM is already running; it does not provision hosts or GPUs. Its target is an already-running, OpenAI-compatible vLLM endpoint, which it manages over HTTP.
Repository: formae-plugin-vllm
Installation
sudo formae plugin install vllm
This installs the plugin into the agent's plugin tree (/opt/pel/formae/plugins/vllm/). The agent picks it up on next startup; restart the agent if it's already running.
The plugin is not bundled with the base formae agent image. For cloud-deployed agents (ECS, ACI, Cloud Run, Helm/K8s), bake the plugin into a derived image instead. See formae plugin for the full command reference, version pinning, and batch installs.
Configuration
Target
Configure a target with the vLLM node's OpenAI base URL:
import "@formae/formae.pkl"
import "@vllm/core/vllm.pkl"
new formae.Target {
label = "local-vllm"
config = new vllm.Config {
baseUrl = "http://localhost:8000"
}
}
| Config key | Required | Notes |
|---|---|---|
baseUrl |
yes | vLLM OpenAI base URL, e.g. http://<node>:8000 |
Credentials
An optional bearer token is read from the VLLM_API_KEY environment variable (sent as Authorization: Bearer <key>); it is intentionally not part of the forma. Leave it unset for an unauthenticated server.
vLLM server prerequisites
The server must be started with --enable-lora and the environment variable VLLM_ALLOW_RUNTIME_LORA_UPDATING=True so that the /v1/load_lora_adapter and /v1/unload_lora_adapter endpoints are accepted.
Once an adapter is loaded, vLLM exposes it as its own model id: consumers call /v1/chat/completions with "model": "<loraName>" and vLLM routes through the base model plus the adapter weights. The base model remains addressable by its own id.
Offline behavior
Edge nodes are intermittently connected, so this is first-class behavior. An unreachable node (connection refused, timeout, DNS, or TLS failure) is reported as unreachable — a recoverable error that is retried — and is never mistaken for a deleted adapter. Offline ≠ deleted.
A resource is reported as missing only on a positive, authoritative absence (the node responded and the adapter is genuinely no longer served), which lets background sync remove an out-of-band-unloaded adapter from inventory. Restoring such an adapter is a matter of re-applying the source forma — re-apply is idempotent: it loads the adapter if missing and no-ops if present.
Examples
Examples live in the plugin repository. Clone the repo and resolve Pkl dependencies before running:
git clone https://github.com/platform-engineering-labs/formae-plugin-vllm.git
cd formae-plugin-vllm
pkl project resolve examples/local
Available examples:
| Example | Description |
|---|---|
| local | Run vLLM locally on a GPU via docker-compose and manage an adapter on it |
| kubernetes | vLLM provisioned by Kubernetes (Deployment + PVC + Service); formae manages both the workload and the adapters loaded on top |
| aws | Bring up a GPU box with the formae AWS plugin, then manage the adapter on it (billable; apply manually) |
# Evaluate an example
formae eval examples/local/forma.pkl
# Apply resources
formae apply --mode reconcile --watch examples/local/forma.pkl
Supported Resources
| Type | Description | Native ID | Discoverable | Extractable |
|---|---|---|---|---|
| VLLM::Inference::LoRAAdapter | Dynamically-loaded LoRA adapter on a running vLLM server (full CRUD) | loraName | Yes | No |
| VLLM::Inference::Model | Base model served by a vLLM node — observe/discover only (set at vLLM startup, not via the API) | id | Yes | No |
For VLLM::Inference::LoRAAdapter, loraName and baseModelName are create-only — changing either triggers a replacement — while loraPath is updated in place (reload).
What's next
- Learn more about resolvables in Res
- See Target resolvables for cross-plugin target configuration
Release notes
See release notes for changes per version.