Data pipeline automation

TL;DR

  • A data pipeline is a backend graph that runs a fixed set of operations over data: an input vertex, processing vertices, an output vertex, connected by typed streams. The same graph that processes a one-off fixture also processes continuous live traffic; what changes between runs is the input source, not the graph.
  • Inputs land in the graph one of three ways: file-backed (upload an artifact, bind it into a vertex), endpoint-pushed (an external caller POSTs into an HTTP input vertex), or source-component (a vertex pulls from a camera, RTSP stream, browser webcam, queue, database, or external API).
  • Outputs leave the graph one of four ways: as an endpoint response (JSON / image / audio frame on a deployed endpoint), as a generated file (written through a vertex's generated_file_schema and saved at deployment teardown), as a replay (deterministic playback evidence), or via a sink component (webhook, MQTT, message bus, external service).
  • Repeatability is the success criterion. A pipeline becomes automation when its inputs are addressable, its graph is pinned to specific component releases, its outputs survive teardown, and its runtime signals are captured. Those four properties are what let the same run reproduce later.
  • Batch, fixture, and live-traffic runs use the same backend primitive. Pinning the components and the graph means a given pipeline produces the same outputs on re-run, because nothing in the runtime is implicit.

Pipelines are backends

A data pipeline in Pipelogic is a backend graph: an input vertex on one side, an output vertex on the other, processing vertices in between, typed streams connecting them, parameters and files bound declaratively, and an operation log carrying the history of how the graph evolved. There is no separate scheduler concept, no separate batch runtime, and no parallel SDK. The graph is the pipeline; the deployment is the pipeline running.

Modeling pipelines this way means everything the platform gives backends applies to pipelines without extra work: type checking before deploy, reproducible operation history, pinned component releases, separable deployments, and lease-based isolated test runs. A team that builds a real-time inference backend already knows how to build a data pipeline — the operating loop and the proof loop are the same.

Mental model

      input source              ─▶  data enters the graph here
      ────────────
        file_id            (upload artifact, bind into a vertex)
        HTTP POST          (an external caller posts in)
        RTSP / camera      (a source component pulls)
        DB / queue / API   (a source component pulls)


    ┌────────────────────── backend graph ─────────────────────┐
    │   ┌──────┐      ┌──────┐      ┌──────┐                   │
    │   │ in   │─────▶│ proc │─────▶│ out  │                   │
    │   └──────┘      └──────┘      └──────┘                   │
    │   vertices wired by typed streams, pinned releases       │
    └───────────────────────────┬──────────────────────────────┘


      output destination         ◀─  results leave the graph here
      ──────────────────
        endpoint response  (JSON / image / audio frame)
        generated file     (saved at deployment teardown)
        replay             (deterministic playback evidence)
        sink               (webhook, MQTT, message bus, …)

The graph stays the same across run modes. Swapping the input vertex for an HTTP ingress versus a file-backed source versus a camera component changes how data enters; swapping the output vertex changes how it leaves. The processing in between — model inference, transformation, aggregation, enrichment — is identical across run modes.

See Backends and Solutions for the underlying primitives.

Picking the input mode

Use this step to decide how data enters the graph.

File-backed is the right choice when the input is a static dataset the team owns: a CSV of historical events, a directory of images for a one-shot batch, a JSON snapshot, a parquet table. Upload the file once (typed correctly), bind it into a vertex slot that accepts that type, and the backend reads from it at run time. The same file can drive a one-off batch and a regression fixture without re-uploading.

Endpoint-pushed is the right choice when an external caller controls the data: a client POSTs records over HTTP, a test driver streams fixtures in via WebSocket, an upstream service forwards events to the backend's input URL. The platform mints a forwarding URL per endpoint; tokens bind to (backend, vertex, endpoint) so the URL stays stable across redeploys.

Source component is the right choice when the backend should pull from a continuous source: a camera, an RTSP stream, a browser webcam, a database, a message queue, an external API. The component owns the pull loop and emits typed records into the graph. This is the common shape for real-time pipelines that have no caller to push them.

The three modes can compose — a graph can have a file-backed historical dataset alongside a live source-component stream, or accept endpoint-pushed events while also reading from a database. The graph topology decides what flows where.

Picking the output mode

Use this step to decide how results leave the graph.

Endpoint response is the right choice when a caller wants the answer back synchronously: a request comes in, the graph processes it, the response is read off the output endpoint's WebSocket. This is the right shape for inference services, transformation APIs, and any pipeline whose client is waiting on the result.

Generated file is the right choice when the pipeline produces artifacts the team wants to keep: a vector index, a derived dataset, a model checkpoint, a set of cropped images. A vertex declares the output in its generated_file_schema; the deployment teardown saves the declared files into the workspace file store. Combined with a lease, this is the shape behind reproducible batch jobs that write artifacts and dispose of everything else.

Replay is the right choice when the result is visual, audio, or streaming and the value is in being able to play it back deterministically. A replay captures a complete run as a deterministic snapshot; future viewers see the same frames the original viewer saw.

Sink component is the right choice when the result should go somewhere external — a webhook, an MQTT topic, an email, a database, another service. A sink vertex owns the delivery; the rest of the graph stays unaware of the destination's specifics.

The automation loop

The repeatable shape of a pipeline run is: upload or arrange the input, ensure the backend graph performs the operations, bind files / parameters / secrets that the graph needs, deploy (or run inside a lease for ephemeral batches), let the input flow, collect the output, capture runtime signals, record the evidence. Most of this is the standard backend loop; what is specific to data pipelines is the discipline of treating inputs and outputs as addressable artifacts rather than ephemeral state.

A batch run that uses a lease is the cleanest pattern for one-off computations: the lease holds the deployment and any temporary fixtures, the deployment writes generated files into the workspace at teardown, the kept artifacts are Promoted out of the lease before rollback. The inputs were pinned, the runtime was pinned, the outputs are addressable, the test apparatus disappeared. The pipeline is reproducible because every piece of it has an identity.

For continuously running pipelines, the same graph stays deployed; the input source is what changes shape (live HTTP traffic instead of a file, a long-lived RTSP stream instead of a directory of frames). The operating loop is the same as any other production deployment — watch containers, redeploy on version bumps, undeploy when retired.

What "done" means

A pipeline is done when four things are true. The backend graph performs the operations and validates cleanly. The input source is wired (uploaded, endpoint-bound, or sourced by a component). The output is collected somewhere addressable — an endpoint response captured in tests, a generated file saved into the workspace, a replay stored for playback, a sink delivery confirmed. The runtime signals are captured: container logs at least, and ideally any platform-side observability the proof loop later wants to bisect against.

The fifth, optional but recommended, criterion is reusability: the input identity is preserved, the backend ID is preserved, the component version IDs are preserved, the expected output criteria are preserved. That collection of pinned identifiers is what turns "we ran the pipeline" into "we have evidence we can re-run any time".

Common failure shapes

Most pipeline failures fall into a small set of shapes, and naming them up front saves debugging time:

  • Graph rejected at edit time — a typed stream mismatch between two vertices. The platform names the edge and the conflict; the fix is usually a transformation between mismatched but compatible types.
  • File-backed input bound but runtime cannot read — upload type does not match what the consuming file_schema accepts, or a Triton-style model repository was pre-tarred instead of uploaded as a directory.
  • Output is empty — the source emitted nothing (check upstream), a transformation filtered everything (check filter predicates), the processing vertex failed silently (read container logs), the sink is not connected (check the graph topology).
  • Generated file missing after teardown — the producing vertex did not declare the file in its generated_file_schema, or the deployment was torn down without saving generated files.
  • Test passes but production data fails — the fixture did not represent production format, rate, encoding, or auth shape. The fix is fixture coverage, not pipeline change.

Broader failure patterns live in Common failures.

Where this fits

Pipelogic models data pipelines as backends so that processing data and serving requests use one system rather than two. The platform's backend features apply to pipelines directly: type safety, reproducibility, isolation through leases, separated deployments, proof loops, and container-level operability. Data engineers coming from bespoke ETL tooling learn the backend vocabulary once and reuse it everywhere.

The discipline that makes pipelines reliable is the same discipline that makes any backend reliable: pin the inputs, pin the components, pin the graph, capture the outputs, keep the runtime signals. The platform provides those primitives; this flow shows how to apply them to data work.

Related

Was this page helpful?