Dagger¶
Dagger is a Python library that allows you to:
- Define sophisticated DAGs (direct acyclic graphs) using very straightforward Pythonic code.
- Run those DAGs seamlessly in different runtimes or workflow orchestrators (such as Argo Workflows, Kubeflow Pipelines, and more).
🧰 Features¶
- Express DAGs succinctly.
- Create dynamic for loops and map-reduce operations.
- Invoke DAGs from other DAGs.
- Run your DAGs locally or using a distributed workflow orchestrator (such as Argo Workflows).
- Take advantage of advanced runtime features (e.g. Retry strategies, Kubernetes scheduling directives, etc.)
- ... All with a simple Pythonic DSL that feels just like coding regular Python functions.
Other nice features of Dagger are: it adds no extra dependencies to your project, it is reliable (with 100% test coverage), and it has great documentation and plenty of examples to get you started.
🎯 Guiding Principles¶
Dagger was created to facilitate the implementation and ongoing maintenance of data and ML pipelines.
This goal is reflected in Dagger's architecture and main design decisions:
- To make common use cases and patterns (such as dynamic loops or map-reduce operations) as easy as possible.
- To minimize boilerplate, plumbing or low-level code (with Dagger you don't need to serialize your outputs, store them in a remote file system, download them and deserialize them again; all of this is done for you).
- To onboard users in just a couple of hours through great documentation, comprehensive examples and tutorials.
- To never sacrifice reliability and performance, and to keep a low memory footprint by using I/O streams and lazy loading where possible.
⛩️ Architecture¶
Dagger is built around 3 components:
- A set of core data structures that represent the intended behavior of a DAG.
- A domain-specific language (DSL) that uses metaprogramming to capture how a DAG should behave, and represents it using the core data structures.
- Multiple runtimes that inspect the core data structures to run the corresponding DAG, or prepare the DAG to run in a specific pipeline executor.
The Core Data Structures¶
Dagger defines DAGs using a series of immutable data structures. These structures are responsible for:
- Exposing all the relevant information so that runtimes can run the DAGs, or translate them into formats supported by other pipeline executors.
- Validating all the pieces of a DAG to catch errors as early as possible in the development lifecycle. For instance:
- Node inputs must not reference outputs that do not exist.
- The DAG must not contain any cycles.
- Nodes and outputs are partitioned in ways supported by the library.
- etc.
They are divided into different categories:
- Nodes may be tasks (functions) or DAGs (a series of nodes connected together). DAGs can be nested inside of other DAGs.
- Inputs may come from a DAG parameter or from the output of another node.
- Outputs may be retrieved directly from the return value of a task's function, or from a sub-element of that value (a key or a property).
- Every input/output has a serializer associated with it. The serializer is responsible for turning the value of that input/output into a string of bytes, and a string of bytes back into its original value.
Does it sound interesting? See it in action!