Models
TL;DR
- A model on Pipelogic lives on a component vertex. There are two integration paths: file-served (upload an artifact and bind it to a vertex's
file_schemaslot) and runtime-fetched (point a vertex parameter at a hub model id and let the serving service pull it at deploy time). - Serving runtimes are services, not components. Triton, TorchServe, Ollama, vLLM, and SGLang are platform-managed services. A component declares
depends_on: [<service>]in itscomponent.yml; the platform stands the service up next to the component's container at deploy time. You never pin Triton itself as a vertex. - Among serving services there are two shapes worth knowing: multi-model services (Triton, TorchServe, Ollama) run one instance that serves many models, picked at call time. Parametrized services (vLLM, SGLang) run one instance per model — a second model means a second instance.
- A model integration is done when representative fixtures produce expected behaviour in the live runtime. "The upload returned 200" is not the bar; semantic correctness is. Mismatch (wrong labels, wrong tokenizer, wrong image size, wrong preprocessing) is the failure mode that survives clean deploys.
- The platform separates artifact (immutable workspace file) from binding (vertex slot in a backend) from serving (managed service alongside the component). Swapping any of the three is a single graph mutation, not a system migration.
What "models on Pipelogic" actually means
There are three common ways to use a model in production: bake the weights into the component image, run a model server as a service the component depends on, or point at a hosted endpoint. Pipelogic centres on the second: the model artifact and the serving runtime are kept separate from the component code, so the component declares what it needs and the platform supplies it.
Pipelogic separates three concerns. The model artifact is a typed workspace file — uploaded once, addressable by file_id, reusable across many backends, bound into a component file slot rather than baked into the image. The vertex binding is the place in the backend graph where the artifact lives — the component's file_schema slot, bound by ppl backend add-file, recorded in the operation log, swappable per backend without touching either the artifact or the component. The serving runtime is a platform-managed service the component requires via depends_on: — the platform stands it up next to the component container at deploy time, the component talks to it over the local network, the team never operates Triton or vLLM by hand.
The discipline this design asks for is to stop thinking of models as code that runs inside the component. The component is the wrapper that exposes the typed contract to the rest of the backend; the model is the artifact the wrapper consumes; the serving runtime is the service that loads the artifact and answers inference requests. Keeping the three separate is what makes "swap this model for that one" a single graph mutation rather than a code change, a republish, and a redeploy.
Mental model — component vertex, plus a serving service alongside it
A model lives on a Component vertex. The serving runtime is a service the Component requires via depends_on:; the platform brings it up alongside the Component's container at deploy time. The Component's component.yml picks the integration path:
File-served (uploaded artifact)
┌─────────────────────────────┐
│ file_schema: │
│ model: │
│ file_type: model │
│ config_key: model_name │
│ depends_on: [triton] │
└─────────────────────────────┘
▲ ppl backend add-file
│ $BID --vertex N --key model
│ --file <file_id>
Runtime-fetched (hub id)
┌─────────────────────────────┐
│ config_schema: │
│ model_name: │
│ type: String │
│ default: "<hub-id>" │
│ cache: │
│ huggingface_hub: │
│ - ids: model_name │
│ depends_on: [sglang] │
└─────────────────────────────┘
▲ ppl backend change-parameter
│ $BID --vertex N --name model_name
│ --type String --value "<hub-id>"
The backend binds either side with one of two operations; the component code stays generic, and the serving service comes up automatically because the component requires it.
Which path is right for which model
Use this framing when planning a new model integration.
File-served is right when the artifact is custom-trained, exported, or otherwise owned by the team. A fine-tuned ONNX model with a specific label set, a Triton model repository with a custom config.pbtxt, a TorchServe MAR with custom pre/post handlers — all of these are file-served. The artifact is uploaded as a typed workspace file, bound into the consuming vertex's slot, and travels with the backend graph as a binding. Swapping artifacts later is the same add-file operation pointed at a different file_id.
Runtime-fetched is right when the artifact is hub-hosted and the team is willing to depend on the hub. A hub-hosted LLM that vLLM or SGLang knows how to pull on demand, an Ollama model name that the serving service resolves against its catalog, a Triton model repository hosted in a remote registry — all of these are runtime-fetched. The vertex carries a string identifier; the serving service does the pull at first request (or at deploy time, depending on the service's caching strategy).
The trade-off is reproducibility versus storage. File-served means the artifact is pinned in the workspace file store — bit-for-bit reproducible runs every time. Runtime-fetched means the artifact lives in the hub — reproducibility depends on the hub resolving a given identifier to the same weights over time. For team-owned artifacts the trade-off usually favours file-served; for stable hub models the trade-off usually favours runtime-fetched. File-serving a model the hub already hosts duplicates it in workspace storage; runtime-fetching a private fine-tune depends on a hub the team does not control.
Walkthrough — file-served (Triton detector)
# 1. upload the Triton model repository (directory root, NOT pre-tarred)
ppl file upload ./model_repository \
--type model \
--name "warehouse-detector" \
--readme @./README.md \
--config ./model.config.yml
# returns $FID
# 2. bind the file to the Triton Component vertex's `model` slot
ppl backend add-file $BID --vertex 2 --key model --file $FID
# 3. deploy and run representative fixtures
ppl backend deploy --backend $BID
A Triton repository's expected layout:
model_name/
├── config.pbtxt
└── 1/
└── model.onnx # or model.plan / model.pt / model.savedmodel/
The CLI tars the directory at upload time; pre-tarred uploads produce a nested archive Triton cannot load. The optional --config attaches a YAML that sets matching vertex parameters (output names, thresholds, image size, class labels) at bind time so the Backend doesn't drift from the artifact.
Walkthrough — runtime-fetched (LLM via SGLang)
# 1. point the vertex at a hub model id
ppl backend change-parameter $BID --vertex 3 \
--name model_name --type String --value '"Qwen/Qwen2.5-1.5B-Instruct"'
# 2. for a gated model, bind a Workspace secret for the hub token
ppl backend change-parameter $BID --vertex 3 \
--name hf_token --type String --value "<secret-id>"
# 3. deploy — the deploy blocks on the runtime's cold pull / weight load
# before it reports ready
ppl backend deploy --backend $BID
For gated hub models, the token is a workspace secret bound by ID, not a plaintext parameter — see Workspace secrets.
Multi-model vs parametrized serving services
Use this framing when picking which serving service a component should depend on.
Serving services come in two operational shapes, and the shape determines how many instances run for how many models.
Multi-model services run one instance that serves many models. Components that depend on Triton, TorchServe, or Ollama share the same service instance and pick their model by ID at call time. The pattern is right when the team is serving many models with similar runtime characteristics — half a dozen ONNX detectors with different label sets, a fleet of small LLMs for different use cases, a set of TorchServe MARs with custom handlers — and the per-model overhead of running independent instances would dominate the actual inference cost.
Parametrized services run one instance per model. Components that depend on vLLM or SGLang bind a specific model_name at startup; serving a second model means a second instance of the service. The pattern is right when the model is large enough or the inference engine is opinionated enough that mixing models inside one instance would defeat the optimisation the engine exists for. vLLM's continuous batching, SGLang's structured-generation runtime, and similar specialised engines are written assuming "one model lives in this process"; the platform respects that.
| Service | Kind | Best for |
|---|---|---|
| Triton | multi-model | CV detection / segmentation / classification, ONNX / TensorRT / TorchScript / TF SavedModel, dynamic batching. |
| TorchServe | multi-model | PyTorch models with custom Python pre/post handlers (MAR). |
| Ollama | multi-model | Quick LLM / VLM iteration, local or self-hosted. |
| vLLM | parametrized | High-throughput LLM serving, long context. |
| SGLang | parametrized | Structured generation, VLMs, OpenAI-compatible serving. |
The practical implication for backend design: two vertices that consume different LLMs through SGLang or vLLM bring up two service instances; the same two vertices against Ollama or Triton share one. That sharing is the right default for "many small models" and the wrong default for "one big optimised model".
Mismatch is the failure mode
Use this framing the first time a clean deploy ships a broken model.
A model integration that uploads cleanly, binds cleanly, and deploys cleanly can still be broken. The failures cluster into a small set of shapes that share one property — none of them produce schema errors at deploy time. The platform's type system catches the shape failures (wrong file type for the slot, wrong literal type for a parameter); the semantic failures are the team's to catch with the proof loop.
The shapes worth naming, in roughly the order of frequency:
- Wrong labels — the model's output class IDs do not match what the downstream consumer expects. Schema-valid output, semantically wrong.
- Wrong tokenizer — text-in / text-out works as JSON; every character is wrong because the tokenizer disagrees with the model's training tokenizer.
- Wrong preprocessing — image size, channel order, normalisation, dtype mismatch between what the model expects and what the upstream emits.
- Wrong task — classification weights bound to a detector component, segmentation weights to a classifier.
- Cold-start mistaken for broken endpoint — first request times out because the serving service is still loading the model; second request would have succeeded.
- Missing access for gated models — the workspace secret for the hub token was not bound to the component's access-token parameter, so the runtime's pull of the gated repository is rejected by the hub and the model never loads. The deploy fails on weight load, not on schema.
The proof loop in Prove behavior is the place these failures are caught. The platform makes the proof cheap (pinned artifacts, deterministic graphs, lease-isolated test runs); the team's job is to actually run the proof before declaring "done".
Replacing and iterating
Use this framing when "the artifact needs to change" comes up.
Replacing a bound artifact is a single graph mutation. Upload a new file (ppl file upload), then re-bind the slot to the new file_id. The platform's operation log records the swap; the prior binding is undoable through the usual backend undo. No component republish, no redeploy across the rest of the graph — just the one binding edit and, if the serving service requires it, a redeploy of the affected vertex.
The pattern that supports cheap iteration is to keep the artifact addressable. Every uploaded file has a stable file_id; binding any of them into the vertex slot is the same operation; the team can A/B between two artifacts by binding them into sibling vertices in the same backend and comparing outputs on the same fixture set. That is the structural answer to "is the candidate better than the incumbent" — see Model artifact integration for the comparison loop.
Where this fits
Models are a first-class concern on the platform because the integration loop is the place most AI projects accumulate complexity. By separating the artifact from the binding from the serving service, the platform makes each piece swappable independently — and that swappability is what keeps the integration loop cheap as the project grows. The team writes the component once, the serving service runs without team operations, the artifact swaps as the model improves. Each piece has its own lifecycle; the platform's job is to make them compose.
Related
- Files — the File primitive and lifecycle.
- File schema — declaring the model's
file_schemaslot. - File types —
file_typevalues for model artifacts. - File binding — upload and bind the artifact.
- Backend operations —
add-fileandchange-parametersemantics. - Secrets — binding gated hub tokens by ID, never as plaintext.
- Model artifact integration — the full upload + bind + iterate loop.
- Solutions — where serving components fit alongside input / transform / output components.