Prove backend behavior

Run a representative fixture through the real backend runtime, capture the evidence, and decide pass or fail before you promote.

TL;DR

A proof is a test-and-verify loop: a representative fixture runs through the real backend runtime, produces the expected behavior, and leaves capturable runtime signals. "It compiled" is not a proof.
The loop has the same shape regardless of what you are proving — a new component release, a trained model artifact, a data-pipeline backend, a graph revision: identify the unit under test, freeze the fixtures, run them through the actual runtime, collect outputs and runtime signals, compare against semantic expectations, record the evidence.
A proof pins a backend's operation history so the specific version you tested stays addressable later — "the version we tested yesterday" resolves to an exact graph. Reproducibility comes from the event-sourced operation log replaying to the same graph, against the same pinned releases and the same fixtures.
The platform supplies the primitives that make proof reproducible — pinned releases, deterministic graphs, isolated leases for ephemeral test backends, container logs — and leaves the definition of "passes" to you. You define the expectations; the platform makes them repeatable.
Semantic correctness matters more than schema. A detector that emits valid JSON with the wrong labels is broken; a transcription model that produces well-formed text in the wrong language is broken. Proof is about meaning, not shape.
Proof precedes promotion. The same fixture set runs against the candidate and the incumbent; the delta is the evidence the promotion decision rests on.

Pick a proof mode

Four starting points — open the one that matches what you're proving. Each links to the flow that carries it.

Component or graph changeLive fixture through a lease-backed test backend, with auto-cleanup. The default. Trained model artifactSame fixtures against the model-bound backend; isolate the artifact as the variable. Data pipelineRepresentative inputs through the backend; captured outputs become the next baseline. Is the candidate better?Same fixtures, two pinned versions, one delta — the evidence promotion rests on.

What proof is

A proof is a named, repeatable loop that turns a change — a component release, a model artifact, a data-pipeline backend, a graph revision — into reproducible evidence before that change reaches production or promotion. The output is a recorded result: the unit under test, the fixture, the expected behavior, the observed behavior, the runtime signal, and a conclusion.

Keeping proof in its own loop is what makes the evidence reproducible. Every promotion needs evidence; that evidence has to re-run the same fixtures against the same pinned versions on the same runtime that real traffic will see. A backend's operation history is an event-sourced log, so the graph you tested is addressable after the fact — the log replays to the same graph, and "the version we tested yesterday" points at an exact set of operations rather than a moving target.

Proof is distinct from testing in the unit-test sense. Unit tests run code in isolation; proof runs the actual graph on the actual runtime with representative inputs. A unit test can pass while the integrated system fails, because the test never saw the integration. A proof always sees the integration, because the integration is the only thing it runs through.

What counts as evidence

A proof records six things; omitting any of them leaves you with a result, not evidence:

Evidence	What to record	Why it matters
Unit under test	Component version ID, model artifact ID, backend ID (or the operation set that mutated it), fixture identity	"The OCR component" is not specific enough; the released version is
Input	The fixture itself, addressable and immutable	A proof you cannot re-run against the same fixture next quarter is not reproducible
Expected behavior	What the output should mean, not what shape it should have	"Detects faces in the image" is semantic; "returns a JSON array" is only structural
Observed output	The actual response captured from the runtime, including generated files	This is what you compare against the expectation
Runtime signal	Container logs, error lines, latency observations	Without it, cold-start cliffs, intermittent failures, and resource exhaustion stay hidden
Conclusion	Pass / fail / mismatch, with the specific risks called out	"Pass" without a risk note hides the parts of the fixture the run did not exercise

The platform exposes the primitives for all six; assembling them into evidence is the proof loop's job.

Proof modes in detail

Different starting points produce different proof shapes. The cards above link straight to each flow; the sections below explain when each mode is the right one and what makes its evidence defensible.

Live fixture through a test backend

Use this mode when the unit under test is a component release or a graph change and you want a quick reproducible run with auto-cleanup.

This is the default mode for component authors. A short-lived lease stands up a test backend wrapping the candidate component, runs the fixture, captures outputs and container logs, and tears down. The lease is the isolation boundary: nothing about the test run touches production graphs or production deployments, and the candidate version is exercised exactly the way a real backend would exercise it.

Because the test backend uses the same backend primitives as production, the proof is faithful: typed streams flow through the candidate, the component sees the same containerized environment, the same SDK helpers, the same serving-service dependencies. There is no "this works in the unit test but the runtime is different" gap.

See Test with a live backend and Leases.

Same fixtures against the model-bound backend

Use this mode when the unit under test is a model artifact: weights, checkpoint, ONNX file, adapter, or fine-tune.

Model proof is rarely "does it produce JSON". It is "does it produce the right output on the inputs we care about". The loop is: upload the candidate artifact, bind it into the target backend vertex's file slot, run the fixture set, compare observed outputs against the semantic expectation. When you are upgrading a model, the proof is repeated against the previous artifact in the same backend, and the delta is the artifact for the promotion decision.

The reason this mode is distinct from "live fixture" is that the unit under test is the artifact, not the component. Two model artifacts running through the same component, same graph, same fixture set isolate the artifact as the variable — and that isolation is what makes "this artifact is better than the incumbent" defensible.

See Model artifact integration and Models.

Repeatable data fixture through the backend

Use this mode when the unit under test is a data pipeline.

Data-pipeline proof is fixture-coverage proof: representative inputs are run through the backend, generated outputs are captured, and the run is treated as the next regression baseline. Because the backend is a typed real-time graph, the proof loop does not need a separate batch-processing harness — the same backend that runs scheduled traffic in production runs the fixture in proof.

The reason this mode is its own row is the artifact-capture habit: pipeline proofs are most useful when the generated outputs survive teardown and become the next run's expected outputs. Deployment teardown can save generated files into the workspace file store explicitly, which is what makes pipeline proof reproducible across versions of the underlying components.

See Data pipeline automation and File upload and binding.

Version comparison

Use this mode when the promotion question is "is the candidate better than the incumbent".

Version comparison is the same fixture set, run twice, against two pinned units (two component versions, two model artifacts, two backend revisions), with the same semantic criteria applied to both observed outputs. The result is the delta, not either run in isolation. A candidate that is "good" but worse than the incumbent should not be promoted; a candidate that is "imperfect" but materially better than the incumbent often should.

The platform makes version comparison cheap because everything is pinned and immutable. The fixture is addressable; the two units under test are addressable; the runtime is the same backend graph with one binding swapped. There is no "did we accidentally compare against a different fixture" failure mode.

See Release semantics.

The proof loop in motion

The minimum sequence is short and looks the same across modes:

Identify exactly what is under test (version IDs, artifact IDs, backend ID).
Freeze the fixture set so the same inputs can be re-run later.
Run the fixture through the actual runtime — a lease-backed test backend for component / graph proof, a deployed backend for model / pipeline proof.
Wait until every vertex container reports running before sending input; URLs can resolve before the underlying containers are listening.
Send the fixture, collect the observed output (including any generated files), capture the container logs that accompany it.
Compare observed behavior to semantic expectation.
Record the evidence — IDs, fixture, expected, observed, runtime signal, conclusion, mismatch risks.

When the runtime misbehaves, container logs are the source of truth. The platform deliberately does not synthesize a unified "this is what went wrong" surface; it exposes container output verbatim and lets the proof loop decide what counts as a failure signal.

Why proof gates promotion

Promotion — turning a prerelease into a released version, swapping a model artifact into production, pointing real traffic at a new backend revision — is the step where a change starts serving real traffic. It is reversible, but reverting only helps once a regression has already been noticed.

Proof gates promotion because the other checks cannot answer the questions promotion depends on: code review cannot tell you whether a model regressed on a hard subset of inputs, a passing build cannot tell you whether the new component's container hangs on cold-start, a manual smoke test cannot tell you whether a schema-correct response still means the same thing. An evidence-bearing proof loop can.

The platform keeps promotion as an explicit, separate command — never implicit in publish, never implicit in deploy. The promotion call is where a human attaches their judgement to the evidence the proof loop produced.

Test with a live backend — the default test-harness mode.
Leases — isolation boundary for ephemeral test backends.
Model artifact integration — model-bound proof loop.
Data pipeline automation — pipeline proof loop.
Release semantics — what promotion actually changes.
Deploy and monitor — runtime signal capture during proof.
Common failures — symptom → fix lookup.