Curated catalog
A hand-picked set of popular open models, configured and ready — no hunting for weights or wiring up a serving stack.
A hand-picked set of popular open models, configured and ready — no hunting for weights or wiring up a serving stack.
From frontier-scale deepseek-r1-671b down to the lean qwen2.5-7b, pick the size that fits your task and your GPU budget.
Not in the catalog? Push your own model weights or a custom container and deploy it the same way.
Choose a model, pick a GPU, and deploy. We provision the card and boot the runtime so your endpoint comes up shortly after.
Catalog models run on vLLM with continuous batching, giving you strong throughput without tuning a serving stack.
Every deployment exposes an OpenAI-compatible API, so your existing client works by swapping the base URL and key.
Catalog or custom, the billing is the same: you rent the GPU by the hour, with no per-token fees.
General-purpose conversation and instruction following with Qwen, Llama, and Gemma.
Step-by-step problem solving with reasoning models like DeepSeek-R1.
Code generation and completion with models such as Qwen Coder.
Deploy your own weights or container alongside the catalog, on the same runtime and pricing.
Popular open models across families — Qwen (including Qwen Coder), DeepSeek (including DeepSeek-R1), Llama, Gemma and more, spanning sizes from a few billion parameters up to frontier-scale.
Yes. Push your own model weights or a custom container and deploy it the same way as a catalog model, on the same runtime and per-hour pricing.
Catalog models run on vLLM, which provides continuous batching and high throughput behind an OpenAI-compatible endpoint.
Per hour while the deployment runs. You rent the GPU; there are no per-token fees, whether you deploy from the catalog or bring your own.
Most deployments come up minutes after you click deploy — we provision the GPU and boot the runtime for you.