Hello Team!
So it's been a few months since we started using kedro and it's time to deploy some of the pipelines we have created.
We need to choose an orchestrator but this is not our field of expertise, so I wanted to ask for some help. We would like something simple to setup and use collaboratively. Also my company requires it is free (at least for now), our cloud provider is AWS and we already use mlflow. Here are the alternatives we found:
hello !
disclaimer: I'm not an expert in any of these tools (alternative way of saying "it depends" 🙂)
when you talk about the limitations of the various free tiers, I guess those apply to the corresponding Cloud/Hosted option, right? taking the example of Prefect, to the best of my knowledge Prefect OSS doesn't have any limitations. the free tier of Prefect Cloud does, though (max of 5 000 runs/day). I guess something like that applies to Flyte vs Union, or Airflow vs Amazon MWAA (yes, AWS offers a managed Airflow service).
if you intend to <i>operate</i> the orchestrator yourself, then you're free to choose from the different OSS options. what do you want out of an orchestrator? given that your business logic (hence CPU-bound tasks) will live in the Kedro pipelines themselves, probably you'll want to pick a simple orchestrator that dispatches tasks, centralizes logs, displays execution status (and in my personal opinion, you don't need Kubernetes for that).
tl;dr: think carefully whether you want to operate your orchestrator yourself, or use some managed service.
then, you need to think <i>where</i> your pipelines will run. considering that your cloud provider is AWS, you'll be looking at Amazon EC2, Amazon ECS, etc. definitely <i>not</i> the same hardware where your orchestrator lives, otherwise you risk taking it down accidentally!
again taking Prefect as an example, looks like prefect-aws
allows you to deploy your flows on ECS.
the most tried and tested orchestrator out there is Airflow, and there's an official Kedro plugin for it. but building Kedro translators isn't really a terribly difficult task, just see the code snippet in our docs that translates pipelines to Prefect for example.
I'll let others comment on their specific experiences 👂
Hi and thanks for your reply!
to the best of my knowledge Prefect OSS doesn't have any limitations. the free tier of Prefect Cloud does, though (max of 5 000 runs/day)I missed that, thanks for pointing it out!
if you intend to <i>operate</i> the orchestrator yourself, then you're free to choose from the different OSS options. what do you want out of an orchestrator? given that your business logic (hence CPU-bound tasks) will live in the Kedro pipelines themselves, probably you'll want to pick a simple orchestrator that dispatches tasks, centralizes logs, displays execution status (and in my personal opinion, you don't need Kubernetes for that).Indeed, I think this is what I need. Also, I wonder if it could allow me to:
prefect
and airflow
documentations, thanks! :)Dagster is another one not on your list.
In general, building an orchestrator integration isn't <i>that</i> bad; they all look kinda similar. If you want to go with Flyte or something, it shouldn't be that hard.
I'm kind of bought into the view Dagster pushes, that a data orchestrator should be asset-oriented (like Kedro is also asset-oriented), and not task-oriented (e.g. Airflow). You can read more here: https://dagster.io/blog/impedance-mismatch-in-data-orchestration
(There are other asset-oriented orchestrators, but I would need to find a list of which is which to not make a mistake :P)
Hi !
Thank you for your reply. I was looking at the bridge script between kedro and prefect in the docs as suggested and it does seem quite straightforward. It's also a very transparent way to use an orchestrator from kedro.
I had a look at dagster
and watched the talk by Pete Hunt and it does make a lot of sense. I will have a better look, thanks for sharing!
I am having quite a lot of fun playing with dagster and the concept of asset-oriented orchestration. But I am not sure why do you say Kedro is asset-oriented?
I still have a naive understanding, but it seems to me that a practical difference when you have an asset-oriented orchestrator is that running a job is about materializing data assets.
In Kedro, we run nodes. which to me means it is task-oriented. Am I thinking about it wrongly?
this "Principal Components Analysis" of orchestrators also considers Kedro an asset-based micro-orchestrator https://www.run.house/blog/lean-data-automation-a-principal-components-approach
in Kedro we run nodes yes, that many times end up in calling a dataset .save()
method, so I definitely agree with the asset-oriented nature.
I think the automatic arrangement of the DAG makes Kedro a poor task-based orchestrator (every time somebody tries that, they have to create "dummy datasets" to have control over said tasks)
Interesting read, thank you. I see your point. Kedro is structured around a data catalog, has inbuilt load/save logic at the node levels and generate automatically a DAG based on data lineage, therefore it is asset-driven. Even if we can't do
kedro materialize --assets model_input_tableand we have to do
kedro run --pipelines data_processing
BTW dagster does not come with a data catalog (it is in dagster+ apparently), so it maybe even more interesting to combine it with kedro.
about Flyte - I have a working prototype of kedro-flyte
- I hope we will be able to open source it in Nov/Dec this year.
Thanks, would be nice to be able to try it out. May I ask you what would make you choose flyte
over say prefect
for your kedro
pipeline orchestration?
Main reason is in the execution layer: Prefect is mainly an orchestrator. Flyte is an orchestrator and an execution layer - it natively supports distributed execution. Prefect does not.
I see, if I undersdood propely prefect
seems to have a task runner concept that can make use of external tools like dask
or ray
., while flyte
is kubernetes-native so you get distributed execution natively (while still having access to dask
, ray
and spark
throught OSS kubernetes operators).
One painful point for me is Kubernetes, I have only a basic understanding of how it works and not much time to focus on learning it. Even though I would get some (limited) support from a DevOps in my company that has Kubernetes expertise, I am wondering if I should choose an orchestrator that relies on a technology that I don't understand. If any of you have some advice or experience being in this situation I would really appreciate it. 🙂
The question to answer is - what is your use case for the orchestrator?
If it's for scheduling Kedro pipelines or whether it's for scaling up Kedro pipelines.
For 1, Prefect is good option, for 2 IMHO no.
FYI, after some careful consideration, we are starting to move forward with dagster.
I have made a small repo with the translation logic from kedro to dagster: kedro-spaceflights-dagster (see here for the logic). There's still room for improvement and I plan to keep this repo updated as I go deeper into the capabilites of dagster. If there is any interest on your side, I'd be happy to open a PR to add this to the kedro deployment options.
thanks a lot for sharing ! how was the experience of building the code that translates Kedro into Dagster assets?
It was straightforward but I had to make some choice as dagster has different options to define a generic task. I replied with more details on the spike issue.
I will keep looking into better integrating kedro with dagster, While I have a translation script, it is not sufficient to actually use dagster in production (e.g. I can't assign execution targets to nodes or pipelines atm). Also, I believe there are other concept in dagster that could be mapped to kedro or handled with kedro's omegaconf setup.
Very exciting! I haven't actually used Dagster myself, but this was something I was hoping to explore, so glad you beat me to it. I'll definitely take a look later. :)