Dagger¶

Dagger is a Python library that allows you to:

Define sophisticated DAGs (direct acyclic graphs) using very straightforward Pythonic code.
Run those DAGs seamlessly in different runtimes or workflow orchestrators (such as Argo Workflows, Kubeflow Pipelines, and more).

🧰 Features¶

Express DAGs succinctly.
Create dynamic for loops and map-reduce operations.
Invoke DAGs from other DAGs.
Run your DAGs locally or using a distributed workflow orchestrator (such as Argo Workflows).
Take advantage of advanced runtime features (e.g. Retry strategies, Kubernetes scheduling directives, etc.)
... All with a simple Pythonic DSL that feels just like coding regular Python functions.

Other nice features of Dagger are: it adds no extra dependencies to your project, it is reliable (with 100% test coverage), and it has great documentation and plenty of examples to get you started.

🎯 Guiding Principles¶

Dagger was created to facilitate the implementation and ongoing maintenance of data and ML pipelines.

This goal is reflected in Dagger's architecture and main design decisions:

To make common use cases and patterns (such as dynamic loops or map-reduce operations) as easy as possible.
To minimize boilerplate, plumbing or low-level code (with Dagger you don't need to serialize your outputs, store them in a remote file system, download them and deserialize them again; all of this is done for you).
To onboard users in just a couple of hours through great documentation, comprehensive examples and tutorials.
To never sacrifice reliability and performance, and to keep a low memory footprint by using I/O streams and lazy loading where possible.

⛩️ Architecture¶

Dagger is built around 3 components:

A set of core data structures that represent the intended behavior of a DAG.
A domain-specific language (DSL) that uses metaprogramming to capture how a DAG should behave, and represents it using the core data structures.
Multiple runtimes that inspect the core data structures to run the corresponding DAG, or prepare the DAG to run in a specific pipeline executor.

The Core Data Structures¶

Dagger defines DAGs using a series of immutable data structures. These structures are responsible for:

Exposing all the relevant information so that runtimes can run the DAGs, or translate them into formats supported by other pipeline executors.
Validating all the pieces of a DAG to catch errors as early as possible in the development lifecycle. For instance:
- Node inputs must not reference outputs that do not exist.
- The DAG must not contain any cycles.
- Nodes and outputs are partitioned in ways supported by the library.
- etc.

They are divided into different categories:

Nodes may be tasks (functions) or DAGs (a series of nodes connected together). DAGs can be nested inside of other DAGs.
Inputs may come from a DAG parameter or from the output of another node.
Outputs may be retrieved directly from the return value of a task's function, or from a sub-element of that value (a key or a property).
Every input/output has a serializer associated with it. The serializer is responsible for turning the value of that input/output into a string of bytes, and a string of bytes back into its original value.

Does it sound interesting? See it in action!