vllm

vLLM — high-throughput self-hosted inference with OpenAI-compatible API. For production workloads on your own GPUs.

import { vllm } from '@agentskit/adapters'

const adapter = vllm({
  model: 'meta-llama/Llama-3.3-70B-Instruct',
  url: 'http://localhost:8000/v1',
})

#Options

Option	Type	Default
`model`	`string`	required
`url`	`string`	`http://localhost:8000/v1`
`fetch`	`typeof fetch`	global

#Why vllm

PagedAttention + continuous batching → best-in-class throughput.
OpenAI-compatible; cluster-friendly.

Providers overview · ollama · llamacpp

Explore nearby

Peer
Providers
25 native chat and embedder adapters, plus higher-order adapters that compose candidates. Separate from the 140-provider models catalog.
Peer
Choosing an adapter
Capability decision table and rules of thumb for picking a chat adapter.
Peer
Hosted chat adapters
17 managed-LLM adapters. Same contract; swap by changing one import.

✎ Edit this page on GitHub·Found a problem? Open an issue →·How to contribute →

#Options

#Why vllm

#Related

Explore nearby

On this page