Join the Kedro community

Updated 4 weeks ago

Choosing a Simple and Free Orchestrator for Kedro Pipelines on AWS

Hello Team!
So it's been a few months since we started using kedro and it's time to deploy some of the pipelines we have created.
We need to choose an orchestrator but this is not our field of expertise, so I wanted to ask for some help. We would like something simple to setup and use collaboratively. Also my company requires it is free (at least for now), our cloud provider is AWS and we already use mlflow. Here are the alternatives we found:

  • Prefect (open-source, seems nice to use, kedro support, but free tier imposes limitations)
  • Flyte (free?, open-source, seems nice to use, no kedro support)
  • MLRun (free and open-source, no kedro support? seems nice to use but a bit more than an orchestrator, requires python 3.9)
  • Kubeflow Pipelines (free and open-source, kedro plugin, and others seem to think it is complex to setup and maintain)
  • Airflow (free and open-source, kedro plugin)
  • Sagemaker (Amazon, kedro plugin, personally dislike its UI and how other AWS services are organized around it)

What would you recommend? What should we consider to make such a decision?

Thanks for your help :)

1
J
G
D
19 comments

hello !

disclaimer: I'm not an expert in any of these tools (alternative way of saying "it depends" 🙂)

when you talk about the limitations of the various free tiers, I guess those apply to the corresponding Cloud/Hosted option, right? taking the example of Prefect, to the best of my knowledge Prefect OSS doesn't have any limitations. the free tier of Prefect Cloud does, though (max of 5 000 runs/day). I guess something like that applies to Flyte vs Union, or Airflow vs Amazon MWAA (yes, AWS offers a managed Airflow service).

if you intend to <i>operate</i> the orchestrator yourself, then you're free to choose from the different OSS options. what do you want out of an orchestrator? given that your business logic (hence CPU-bound tasks) will live in the Kedro pipelines themselves, probably you'll want to pick a simple orchestrator that dispatches tasks, centralizes logs, displays execution status (and in my personal opinion, you don't need Kubernetes for that).

tl;dr: think carefully whether you want to operate your orchestrator yourself, or use some managed service.

then, you need to think <i>where</i> your pipelines will run. considering that your cloud provider is AWS, you'll be looking at Amazon EC2, Amazon ECS, etc. definitely <i>not</i> the same hardware where your orchestrator lives, otherwise you risk taking it down accidentally!

again taking Prefect as an example, looks like prefect-aws allows you to deploy your flows on ECS.

the most tried and tested orchestrator out there is Airflow, and there's an official Kedro plugin for it. but building Kedro translators isn't really a terribly difficult task, just see the code snippet in our docs that translates pipelines to Prefect for example.

I'll let others comment on their specific experiences 👂

Hi and thanks for your reply!

to the best of my knowledge Prefect OSS doesn't have any limitations. the free tier of Prefect Cloud does, though (max of 5 000 runs/day)
I missed that, thanks for pointing it out!

if you intend to <i>operate</i> the orchestrator yourself, then you're free to choose from the different OSS options. what do you want out of an orchestrator? given that your business logic (hence CPU-bound tasks) will live in the Kedro pipelines themselves, probably you'll want to pick a simple orchestrator that dispatches tasks, centralizes logs, displays execution status (and in my personal opinion, you don't need Kubernetes for that).
Indeed, I think this is what I need. Also, I wonder if it could allow me to:
  • Choose whether I run a set of node on a single machine or each node on a different machine (and if I am asking myself this question, does it mean I separated my pipelines/nodes wrongly?)
  • Choose the target machine for each node (i.e. some task I would like to run on a small/big EC2 instance or some others on a Dask cluster)

For now, I'll have a deeper look at prefect and airflow documentations, thanks! :)

Dagster is another one not on your list.

In general, building an orchestrator integration isn't <i>that</i> bad; they all look kinda similar. If you want to go with Flyte or something, it shouldn't be that hard.

I'm kind of bought into the view Dagster pushes, that a data orchestrator should be asset-oriented (like Kedro is also asset-oriented), and not task-oriented (e.g. Airflow). You can read more here: https://dagster.io/blog/impedance-mismatch-in-data-orchestration

(There are other asset-oriented orchestrators, but I would need to find a list of which is which to not make a mistake :P)

Hi !

Thank you for your reply. I was looking at the bridge script between kedro and prefect in the docs as suggested and it does seem quite straightforward. It's also a very transparent way to use an orchestrator from kedro.

I had a look at dagster and watched the talk by Pete Hunt and it does make a lot of sense. I will have a better look, thanks for sharing!

I am having quite a lot of fun playing with dagster and the concept of asset-oriented orchestration. But I am not sure why do you say Kedro is asset-oriented?

I still have a naive understanding, but it seems to me that a practical difference when you have an asset-oriented orchestrator is that running a job is about materializing data assets.

In Kedro, we run nodes. which to me means it is task-oriented. Am I thinking about it wrongly?

this "Principal Components Analysis" of orchestrators also considers Kedro an asset-based micro-orchestrator https://www.run.house/blog/lean-data-automation-a-principal-components-approach

in Kedro we run nodes yes, that many times end up in calling a dataset .save() method, so I definitely agree with the asset-oriented nature.

I think the automatic arrangement of the DAG makes Kedro a poor task-based orchestrator (every time somebody tries that, they have to create "dummy datasets" to have control over said tasks)

Interesting read, thank you. I see your point. Kedro is structured around a data catalog, has inbuilt load/save logic at the node levels and generate automatically a DAG based on data lineage, therefore it is asset-driven. Even if we can't do

kedro materialize --assets model_input_table
and we have to do
kedro run --pipelines data_processing

BTW dagster does not come with a data catalog (it is in dagster+ apparently), so it maybe even more interesting to combine it with kedro.

about Flyte - I have a working prototype of kedro-flyte - I hope we will be able to open source it in Nov/Dec this year.

Thanks, would be nice to be able to try it out. May I ask you what would make you choose flyte over say prefect for your kedro pipeline orchestration?

Main reason is in the execution layer: Prefect is mainly an orchestrator. Flyte is an orchestrator and an execution layer - it natively supports distributed execution. Prefect does not.

I see, if I undersdood propely prefect seems to have a task runner concept that can make use of external tools like dask or ray., while flyte is kubernetes-native so you get distributed execution natively (while still having access to dask, ray and spark throught OSS kubernetes operators).

One painful point for me is Kubernetes, I have only a basic understanding of how it works and not much time to focus on learning it. Even though I would get some (limited) support from a DevOps in my company that has Kubernetes expertise, I am wondering if I should choose an orchestrator that relies on a technology that I don't understand. If any of you have some advice or experience being in this situation I would really appreciate it. 🙂

The question to answer is - what is your use case for the orchestrator?
If it's for scheduling Kedro pipelines or whether it's for scaling up Kedro pipelines.
For 1, Prefect is good option, for 2 IMHO no.

Got it, thanks for your help 🙂

FYI, after some careful consideration, we are starting to move forward with dagster.

I have made a small repo with the translation logic from kedro to dagster: kedro-spaceflights-dagster (see here for the logic). There's still room for improvement and I plan to keep this repo updated as I go deeper into the capabilites of dagster. If there is any interest on your side, I'd be happy to open a PR to add this to the kedro deployment options.

thanks a lot for sharing ! how was the experience of building the code that translates Kedro into Dagster assets?

It was straightforward but I had to make some choice as dagster has different options to define a generic task. I replied with more details on the spike issue.

I will keep looking into better integrating kedro with dagster, While I have a translation script, it is not sufficient to actually use dagster in production (e.g. I can't assign execution targets to nodes or pipelines atm). Also, I believe there are other concept in dagster that could be mapped to kedro or handled with kedro's omegaconf setup.

Very exciting! I haven't actually used Dagster myself, but this was something I was hoping to explore, so glad you beat me to it. I'll definitely take a look later. :)

Thanks for recommending it. :)

I would really appreciate any feedback, thank you!

Add a reply
Sign up and join the conversation on Slack