Kedro¶

Both Kedro and Apache Hamilton are Python tools to help define directed acyclic graph (DAG) of data transformations. While there’s overlap between the two in terms of features, we note two main differences:

  • Kedro is imperative and focuses on tasks; Apache Hamilton is declarative and focuses on assets.

  • Kedro is heavier and comes with a project structure, YAML configs, and dataset definition to manage; Apache Hamilton is lighter to adopt and you can progressively opt-in features that you find valuable.

On this page, we’ll dive into these differences, compare features, and present some code snippets from both tools.

Note

See this GitHub repository to compare a full project using Kedro or Apache Hamilton.

Imperative vs. Declarative¶

There are 3 steps to build and run a dataflow (a DAG, a data pipeline, etc.)

  1. Define transformation steps

  2. Assemble steps into a dataflow

  3. Execute the dataflow to produce data artifacts (tables, ML models, etc.)

1. Define steps¶

Imperative (Kedro) vs. declarative (Apache Hamilton) leads to significant differences in Step 2 and Step 3 that will shape how you work with the tool. However, Step 1 remains similar. In fact, both tools use the term nodes to refer to steps.

Kedro (imperative)

Apache Hamilton (declarative)

# nodes.py
import pandas as pd

def _is_true(x: pd.Series) -> pd.Series:
    return x == "t"

def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
    """Preprocesses the data for companies."""
    companies["iata_approved"] = _is_true(companies["iata_approved"])
    return companies

def preprocess_shuttles(shuttles: pd.DataFrame) -> pd.DataFrame:
    """Preprocesses the data for shuttles."""
    shuttles["d_check_complete"] = _is_true(
        shuttles["d_check_complete"]
    )
    shuttles["moon_clearance_complete"] = _is_true(
        shuttles["moon_clearance_complete"]
    )
    return shuttles

def create_model_input_table(
    shuttles: pd.DataFrame, companies: pd.DataFrame,
) -> pd.DataFrame:
    """Combines all data to create a model input table."""
    shuttles = shuttles.drop("id", axis=1)
    model_input_table = shuttles.merge(
        companies, left_on="company_id", right_on="id"
    )
    model_input_table = model_input_table.dropna()
    return model_input_table
# dataflow.py
import pandas as pd

def _is_true(x: pd.Series) -> pd.Series:
    return x == "t"

def companies_preprocessed(companies: pd.DataFrame) -> pd.DataFrame:
    """Companies with added column `iata_approved`"""
    companies["iata_approved"] = _is_true(companies["iata_approved"])
    return companies

def shuttles_preprocessed(shuttles: pd.DataFrame) -> pd.DataFrame:
    """Shuttles with added columns `d_check_complete`
    and `moon_clearance_complete`."""
    shuttles["d_check_complete"] = _is_true(
        shuttles["d_check_complete"]
    )
    shuttles["moon_clearance_complete"] = _is_true(
        shuttles["moon_clearance_complete"]
    )
    return shuttles

def model_input_table(
    shuttles_preprocessed: pd.DataFrame,
    companies_preprocessed: pd.DataFrame,
) -> pd.DataFrame:
    """Table containing shuttles and companies data."""
    shuttles_preprocessed = shuttles_preprocessed.drop("id", axis=1)
    model_input_table = shuttles_preprocessed.merge(
        companies_preprocessed, left_on="company_id", right_on="id"
    )
    model_input_table = model_input_table.dropna()
    return model_input_table

The function implementations are exactly the same. Yet, notice that the function names and docstrings were edited slightly. Imperative approaches like Kedro typically refer to steps as tasks and prefer verbs to describe “the action of the function”. Meanwhile, declarative approaches such as Apache Hamilton describe steps as assets and use nouns to refer to “the value returned by the function”. This might appear superficial, but it relates to the difference in Step 2 and Step 3.

2. Assemble dataflow¶

With Kedro, you need to take your functions from Step 1 and create node objects, specifying the node’s name, inputs, and outputs. Then, you create a pipeline from a set of nodes and Kedro assembles the nodes into a DAG. Imperative approaches need to specify how tasks (Kedro nodes) relate to each other.

With Apache Hamilton, you pass the module containing all functions from Step 1 and let Apache Hamilton create the nodes and the dataflow. This is possible because in declarative approaches like Apache Hamilton, each function defines a transform and its dependencies on other functions. Notice how in Step 1, model_input_table() has parameters shuttles_preprocessed and companies_preprocessed, which refers to other functions in the module. This contains all the required information to build the DAG.

Kedro (imperative)

Apache Hamilton (declarative)

# pipeline.py
from kedro.pipeline import Pipeline, node, pipeline
from nodes import (
    create_model_input_table,
    preprocess_companies,
    preprocess_shuttles
)

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess_companies,
                inputs="companies",
                outputs="preprocessed_companies",
                name="preprocess_companies_node",
            ),
            node(
                func=preprocess_shuttles,
                inputs="shuttles",
                outputs="preprocessed_shuttles",
                name="preprocess_shuttles_node",
            ),
            node(
                func=create_model_input_table,
                inputs=[
                    "preprocessed_shuttles",
                    "preprocessed_companies"
                ],
                outputs="model_input_table",
                name="create_model_input_table_node",
            ),
        ]
    )
# run.py
from hamilton import driver
import dataflow  # module containing node definitions

# pass the module to the `Builder` to create a `Driver`
dr = driver.Builder().with_modules(dataflow).build()

Benefits of adopting a declarative approach

  • Less errors since you skip manual node creation (i.e., strings will lead to typos).

  • Handle complexity since assembling a dataflow remains the same for 10 or 1000 nodes.

  • Maintainability improves since editing your functions (Step 1) modifies the structure of your DAG, removing the pipeline definition as a failure point.

  • Readability improves because you can understand how functions relate to each other without jumping between files.

These benefits of Apache Hamilton encourage developers to write smaller functions that are easier to debug and maintain, leading to major code quality gains. On the opposite, the burden of node and pipeline creation as projects grow in size lead to users stuffing more and more logic in a single node, making it increasingly harder to maintain.

3. Execute dataflow¶

The primary way to execute Kedro pipelines is to use the command line tool with kedro run --pipeline=my_pipeline. Pipelines are typically designed for all nodes to be executed while reading data and writing results while going through nodes. It is closer to macro-orchestration frameworks like Airflow in spirit.

On the opposite, Apache Hamilton dataflows are primarily meant to be executed programmatically (i.e., via Python code) and return results in-memory. This makes it easy to use Apache Hamilton within a FastAPI service service or to power an LLM application.

For comparable side-by-side code, we can dig into Kedro and use the SequentialRunner programmatically. To return pipeline results in-memory we would need to hack further with kedro.io.MemoryDataset.

Note

Apache Hamilton also has rich support for I/O operations (see Feature comparison below)

Kedro (imperative)

Apache Hamilton (declarative)

# run.py
from kedro.runner import SequentialRunner
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from pipeline import create_pipeline
# ^ from Step 2

bootstrap_project(".")
with KedroSession.create() as session:
    context = session.load_context()
    catalog = context.catalog

pipeline = create_pipeline().to_nodes("create_model_input_table")
SequentialRunner().run(pipeline, catalog)
# doesn't return values in-memory
# run.py
import pandas as pd
from hamilton import driver
import dataflow

dr = driver.Builder().with_modules(dataflow).build()
# ^ from Step 2
inputs = dict(
    companies=pd.read_parquet("path/to/companies.parquet"),
    shuttles=pd.read_parquet("path/to/shuttles.parquet"),
)
results = dr.execute(["model_input_table"], inputs=inputs)
# results is a dict {"model_input_table": VALUE}

An imperative pipeline like Kedro is a series of step, just like a recipe. The user can specify “from nodes” or “to nodes” to slice the pipeline and not have to execute it in full.

For declarative dataflows like Apache Hamilton you request assets / nodes by name and the tool will determine the required nodes to execute (here "model_input_table") avoiding wasteful compute.

The simple Python interface provided by Apache Hamilton allows you to potentially define and execute your dataflow from a single file, which is great to kickstart an analysis or project. Just use python dataflow.py to execute it!

# dataflow.py
import pandas as pd

def _is_true(x: pd.Series) -> pd.Series:
    return x == "t"

def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
    """Preprocesses the data for companies."""
    companies["iata_approved"] = _is_true(companies["iata_approved"])
    return companies

def preprocess_shuttles(shuttles: pd.DataFrame) -> pd.DataFrame:
    """Preprocesses the data for shuttles."""
    shuttles["d_check_complete"] = _is_true(
        shuttles["d_check_complete"]
    )
    shuttles["moon_clearance_complete"] = _is_true(
        shuttles["moon_clearance_complete"]
    )
    return shuttles

def create_model_input_table(
    shuttles: pd.DataFrame, companies: pd.DataFrame,
) -> pd.DataFrame:
    """Combines all data to create a model input table."""
    shuttles = shuttles.drop("id", axis=1)
    model_input_table = shuttles.merge(
        companies, left_on="company_id", right_on="id"
    )
    model_input_table = model_input_table.dropna()
    return model_input_table

if __name__ == "__main__":
    from hamilton import driver
    import dataflow  # import itself as a module

    dr = driver.Builder().with_modules(dataflow).build()
    inputs=dict(
        companies=pd.read_parquet("path/to/companies.parquet"),
        shuttles=pd.read_parquet("path/to/shuttles.parquet"),
    )
    results = dr.execute(["model_input_table"], inputs=inputs)

Framework weight¶

After imperative vs. declarative, the next largest difference is the type of user experience they provide. Kedro is a more opiniated and heavier framework; Apache Hamilton is on the opposite end of the spectrum and tries to be the lightest library possible. This changes the learning curve, adoption, and how each tool will integrate with your stack.

Kedro¶

Kedro is opiniated and provides clear guardrails on how to do things. To begin using it, you’ll need to learn to:

  • Define nodes and register pipelines

  • Register datasets using the data catalog construct

  • Pass parameters to data runs

  • Configure environment variables and credentials

  • Navigate the project structure

This provides guidance when building your first data pipeline, but it’s also a lot to take in at once. As you’ll see in the project comparison on GitHub, Kedro involves more files making it harder to navigate. Also, it’s reliant on YAML which is generally seen as an unreliable format. If you have an existing data stack or favorite library, it might clash with Kedro’s way of thing (e.g., you have credentials management tool; you prefer Hydra for configs).

Apache Hamilton w~~~~~~~~~~~~~~~

Apache Hamilton attempts to get you started quickly. In fact, this page pretty much covered what you need to know:

  • Define nodes and a dataflow using regular Python functions (no need to even import hamilton!)

  • Build a Driver with your dataflow module and call .execute() to get results

Apache Hamilton allows you to start light and opt-in features as your project’s requirements evolve (data validation, scaling compute, testing, etc.). Python is a powerful language with rich editor support and tooling hence why it advocates for “everything in Python” instead of external configs in YAML or JSON. For example, parameters, data assets, and configurations can very much live as dataclasses within a .py file. Apache Hamilton was built with an extensive plugin system. There are many extensions, some contributed by users, to adapt Apache Hamilton to your project, and it’s easy for you to extend yourself for further customization.

In fact, Apache Hamilton is so lightweight, you could even run it inside Kedro!

Feature comparison¶

Trait

Kedro

Apache Hamilton

Focuses on

Tasks (imperative)

Assets (declarative)

Code structure

Opiniated. Makes assumptions about pipeline creation & registration and configuration.

Unopiniated.

In-memory execution

Execute using a KedroSession, but returning values in-memory is hacky.

Default

I/O execution

Datasets and Data Catalog

Data Savers & Loaders

Expressive DAG definition

â›”

Function modifiers

Column-level transformations

â›”

âś…

LLM applications

â›” Limited by in-memory execution and return values.

âś… declarative API in-memory makes it easy (RAG app).

Static DAG visualizations

Need Kedro Viz installed to export static visualizations.

Visualize entire dataflow, execution path, query what’s upstream, etc. directly in a notebook or output to a file (.png, .svg, etc.). Single dependency is graphviz.

Interactive DAG viewer

Kedro Viz

Apache Hamilton UI

Data validation

Community Pandera plugin

Native and Pandera plugin

Executors

Sequential, multiprocessing, multi-threading

Sequential, async, multiprocessing, multi-threading

Executor extension

Spark integration

PySpark, Dask, Ray, Modal

Dynamic branching

â›”

Parallelizable/Collect for easy parallelization.

Command line tool (CLI)

âś…

âś…

Node and pipeline testing

âś…

âś…

Jupyter notebook extensions

âś…

âś…

Both Kedro and Apache Hamilton provide applications to view dataflows/pipelines and interact with their results. Here, Kedro provides a lighter webserver and UI, while Apache Hamilton offers a production-ready containerized application.

Trait

Kedro Viz

Apache Hamilton UI

Interactive dataflow viewer

âś…

âś…

View code definition of nodes

âś…

âś…

Code versioning

Git SHA (may be out of sync with actual code)

Node-level versioning at runtime

Collapsible view

âś…

âś…

Tag nodes

âś…

âś…

Execution observability

â›”

âś…

Artifact lineage and versioning

â›”

âś…

Column-level lineage

â›”

âś…

Compare run results

âś…

âś…

Rich artifact view

Preview 5 dataframe rows. Metadata about artifact (column count, row count, size).

Automatic statistical profiling of various dataframe libraries.

More information¶

For a full side-by-side example of Kedro and Apache Hamilton, visit this GitHub repository

For more questions, join our Slack Channel