Skip to content

Dagger

License Python Versions Supported Latest PyPI version Test Coverage (Codecov)

Dagger is a Python library that allows you to:

  • Define sophisticated DAGs (direct acyclic graphs) using very straightforward Pythonic code.
  • Run those DAGs seamlessly in different runtimes or workflow orchestrators (such as Argo Workflows, Kubeflow Pipelines, and more).

🧰 Features

  • Express DAGs succinctly.
  • Create dynamic for loops and map-reduce operations.
  • Invoke DAGs from other DAGs.
  • Run your DAGs locally or using a distributed workflow orchestrator (such as Argo Workflows).
  • Take advantage of advanced runtime features (e.g. Retry strategies, Kubernetes scheduling directives, etc.)
  • ... All with a simple Pythonic DSL that feels just like coding regular Python functions.

Other nice features of Dagger are: it adds no extra dependencies to your project, it is reliable (with 100% test coverage), and it has great documentation and plenty of examples to get you started.

🎯 Guiding Principles

Dagger was created to facilitate the implementation and ongoing maintenance of data and ML pipelines.

This goal is reflected in Dagger's architecture and main design decisions:

  • To make common use cases and patterns (such as dynamic loops or map-reduce operations) as easy as possible.
  • To minimize boilerplate, plumbing or low-level code (with Dagger you don't need to serialize your outputs, store them in a remote file system, download them and deserialize them again; all of this is done for you).
  • To onboard users in just a couple of hours through great documentation, comprehensive examples and tutorials.
  • To never sacrifice reliability and performance, and to keep a low memory footprint by using I/O streams and lazy loading where possible.

⛩️ Architecture

Dagger is built around 3 components:

  • A set of core data structures that represent the intended behavior of a DAG.
  • A domain-specific language (DSL) that uses metaprogramming to capture how a DAG should behave, and represents it using the core data structures.
  • Multiple runtimes that inspect the core data structures to run the corresponding DAG, or prepare the DAG to run in a specific pipeline executor.

components

The Core Data Structures

Dagger defines DAGs using a series of immutable data structures. These structures are responsible for:

  • Exposing all the relevant information so that runtimes can run the DAGs, or translate them into formats supported by other pipeline executors.
  • Validating all the pieces of a DAG to catch errors as early as possible in the development lifecycle. For instance:
    • Node inputs must not reference outputs that do not exist.
    • The DAG must not contain any cycles.
    • Nodes and outputs are partitioned in ways supported by the library.
    • etc.

They are divided into different categories:

core data structures

  • Nodes may be tasks (functions) or DAGs (a series of nodes connected together). DAGs can be nested inside of other DAGs.
  • Inputs may come from a DAG parameter or from the output of another node.
  • Outputs may be retrieved directly from the return value of a task's function, or from a sub-element of that value (a key or a property).
  • Every input/output has a serializer associated with it. The serializer is responsible for turning the value of that input/output into a string of bytes, and a string of bytes back into its original value.

Does it sound interesting? See it in action!

Back to top