Model artifact integration
TL;DR
- Model artifact integration is the workflow that connects a trained or selected model to a running backend and confirms it produces the expected outputs. It spans choosing the runtime path, getting the artifact into the backend, wiring it into the graph, running representative fixtures, benchmarking, and recording the evidence for a production-readiness decision. Testing is one step inside it.
- The runtime-path choice — file-served versus runtime-fetched — is a per-artifact decision covered in /concepts/models. This flow is the operational sequence once that choice is made.
- Cold-start is part of the measurement. First-request latency on a freshly deployed model can be an order of magnitude above steady-state latency, so a benchmark that measures only warm runtime understates the real cold path.
- Version comparison answers "is this artifact better than the incumbent": same fixture set, two artifacts in sibling bindings, same semantic criteria. The delta is the evidence the promotion decision needs.
What model integration actually is
Model integration is the path from "the model exists as a file or a hub id" to "the model produces the expected outputs on the inputs that matter, on the runtime that will serve production traffic, with measured characteristics". It spans seven concerns: choosing the runtime path, uploading the artifact with the right type, binding it into the right slot on the right vertex, wiring the surrounding pre- and post-processing components, exercising representative fixtures, measuring runtime characteristics, and capturing the evidence behind the promotion decision.
The platform keeps the artifact separate from the component for the same reason it keeps components separate from backends: the lifecycles do not align. A component release is a contract about input and output types and configuration shape. A model artifact is a snapshot of weights bound through the file-bind path, swappable without changing any of those contracts. Baking the artifact into the component image would force a component republish for every weight change and would carry the weights in every image where the artifact belongs as a separate, addressable file.
Each step in the loop is small. The work is in doing all of them rather than stopping at the first request that returns 200.
Mental model
artifact backend graph
───────── ──────────────
file-served path
ppl file upload ──(file_id)──▶ ┌──────────────────┐
│ vertex file slot │ ◀─┐
└──────────────────┘ │
ppl backend add-file ──────┘
runtime-fetched path
hub model id ──(string param)─▶ ┌──────────────────┐
workspace secret (private) │ vertex model_id │ ◀─┐
│ └──────────────────┘ │
▼ │
ppl backend change-parameter ───────────────────────────┘
│
▼
backend deploy
│
▼
live runtime + fixture run
│
▼
evidence + benchmark
The runtime-path taxonomy — what file-served and runtime-fetched mean, how each behaves at deploy time, and the fact that serving runtimes are services the component depends on rather than vertices — lives in /concepts/models. This flow shows the operations each path uses; see File upload and binding for the file-served path in detail.
Choosing the runtime path
Use this step first. The choice constrains every step that follows.
Which path fits which artifact — and the reproducibility-versus-storage trade-off behind the choice — is covered in /concepts/models. Make that decision before continuing; the rest of this flow is the operations for each path.
Uploading and binding (file-served path)
Use this for custom-trained or exported artifacts the team owns.
Upload registers the artifact as a typed workspace file, then a bind attaches it to the model slot on the vertex. The generic mechanics — --type, the description flags, the --config defaults that travel with the artifact, and ppl backend add-file — are in File binding; the Triton model-repository layout that --type model expects is in File types.
ppl file upload <artifact_path> --type <type> --name "<display>" --readme @./README.md --config ./model.config.yml
ppl backend add-file <backend_id> --vertex <vertex_id> --key <file_schema_key> --file <file_id>
For a model artifact specifically, the --config defaults are where the version's intended runtime settings travel with it: the confidence threshold the fine-tune was trained for, a label map specific to this version, the image size the preprocessing assumes. The bind is part of the backend's event-sourced operation log — undoable, replayable, and swappable by pointing the same operation at a different file_id.
Pointing at a hub model (runtime-fetched path)
Use this for hub-hosted models the serving service can pull on demand.
The runtime-fetched path binds a parameter rather than a file. The component declares a model identifier parameter, the backend sets it to the hub-side id, and the serving service handles the pull.
ppl backend change-parameter <backend_id> --vertex <v> \
--name model_name --type String --value "<hub-model-id>"
For private hub models, the access token is a workspace secret bound to a secret: true parameter the component declares for that purpose:
ppl backend change-parameter <backend_id> --vertex <v> \
--name hub_token --type String --value "<secret-id>"
The cleartext token never enters the backend graph. Rotating the token in the workspace UI is enough; the backend picks up the new value on next deploy. See Secrets for the bind contract.
Wiring the surroundings
Use this step before treating any model as integrated.
A model rarely sits alone in a backend. It needs an input source feeding it correctly shaped data and an output sink consuming its produced shape. At the wiring step the platform's type inference catches shape mismatches at graph-edit time, before the runtime sees them.
Common wiring failures the type system catches early: the model expects Image of a specific size and the upstream emits raw Image without resizing (insert a resize transformation), the model emits [BoundingBox] but the downstream consumer expects Polygon<Double> (insert a converter), the model needs preprocessing the platform exposes as a separate component (add it as a vertex). A correctly wired graph keeps these out of runtime; an uncaught one produces a confusing runtime failure.
Proving the integration
Use this every time. The bind being valid is not the same as the model being correct.
Proof is the loop described in Prove behavior: identify the unit under test (the artifact, not the component or the backend), freeze a fixture set, run them through the deployed backend, and compare observed outputs to semantic expectations. Model proof checks meaning, not shape: a detector that emits valid JSON with the wrong labels is broken, a transcription model that emits well-formed text in the wrong language is broken, a segmentation model that emits valid polygons in the wrong coordinate system is broken. Schema validation catches none of these.
One Triton-specific check belongs in the proof loop for ONNX-on-Triton artifacts: the config.pbtxt dims must match the ONNX shape element-by-element. Symbolic ONNX dimensions show to the runtime as -1, so a config that pins a concrete size against a symbolic dimension fails to load. Confirm the repository's declared shapes line up with the ONNX graph before treating a load failure as a model problem.
Cold-start measurement is part of the proof, not separate from it. Models often load lazily on first use: a hub model pulls on first request, ONNX initialises on first inference, Triton loads the repository when the first matching request arrives. First-request latency can be 10–30 seconds above steady-state, so a benchmark that measures only warm runtime understates the cold path. Run the fixture set twice, once cold and once warm, and report both.
Version comparison
Use this whenever a candidate artifact is up against an incumbent.
"Is the candidate better than the incumbent" is answered by running the same fixture set against two artifacts in sibling vertex bindings, with the same semantic criteria applied to both. The delta is the evidence the promotion decision needs. A candidate that is good but worse than the incumbent should not be promoted; a candidate that is imperfect but materially better than the incumbent often should.
Pinning is what keeps the comparison honest. Both artifacts have addressable file_ids (or addressable hub ids), the fixture set is addressable, and the backend is the same graph with two different bindings. The artifact itself is the only variable under test.
Mismatch risks worth naming
The set of mismatch shapes that survive a clean deploy — wrong labels, wrong tokenizer, wrong preprocessing, wrong task, cold-load-mistaken-for-broken, missing gated-model access — is enumerated in /concepts/models. Run the proof loop against each one that applies to the artifact under test; checking them is the difference between "this model works" and "this model worked on the one example we tried".
Where this fits
Model integration is the longest single workflow in the platform because it does more than test. The artifact enters the workspace, is bound into the right vertex, is wired into a backend that respects its types, is exercised on representative inputs, and is characterized for runtime behavior. Each step is small; skipping one ships a model that underperforms in production for a predictable reason.
The platform makes the loop reproducible: pinned artifacts, pinned components, pinned backends, declarative wiring, evidence captured at each step. The discipline is the team's; the primitives keep it repeatable.
Related
- Models — runtime-path taxonomy in detail.
- File upload and binding — the file-served path.
- File binding — upload-and-bind mechanics reference.
- File schema — the model slot's
file_schemadeclaration. - Secrets — binding hub access tokens.
- Prove behavior — proof loops to run after integration.
- Test with a live backend — cheapest proof harness for model bindings.
- Release semantics — what promotion of a model-bearing component release actually changes.
- Common failures — model/runtime failure lookup.