Evaluating Kedro for Data Engineering Processes

RRyan Wendt

Hello Kedro experts. We’re trying to evaluate how Kedro might fit into our data engineering processes as we deploy ML models for our customers. The nature of our work is such that we expect to deploy similar solutions across different customers who will have different environments. As such there are certain python scripts/packages that we’re expecting to want to port across different environments, as well as aspects of every deployment that we’ll expect to be custom. That probably means we want to have “nodes” in our data engineering pipelines that potentially run with a different set of package requirements as some of the ported code may have conflicting requirements. However, I believe a kedro pipeline typically requires the same requirements.txt to be used throughout. Is that right?

9 comments

ddatajoely

So the dependencies are simplest at a repo / project level, but people pipeline like a package with its own dependencies and build tooling around that concept

ddatajoely

you can also maintain "pipeline specific dependencies" with this pattern
https://docs.kedro.org/en/stable/nodes_and_pipelines/modular_pipelines.html#providing-pipeline-specific-dependencies

it's not super fleshed out, but it's in there

ddatajoely

it's unclear from the deprecation notice if this part is deprecated or just the sharing part , do you know?

JJuan Luis Cano Rodríguez

hi , sadly micropackaging is deprecated. I've been collecting some thoughts on how we could break the 1 Kedro pipeline = 1 set of dependencies https://github.com/kedro-org/kedro/discussions/4147 still early stages, but any ideas are welcome

for now, your best bet is to have several Kedro projects

RRyan Wendt

Ah, thanks for the information regardless. I’m honestly not sure what I would recommend for Kedro's design to solve this problem.

Internally our conversations are turning to making use of containers for whatever elements we want to be portable. Hypothetically, we would then orchestrate these containers with an orchestrator that’s set up for that (argo for example, but perhaps something supported for deployment in Kedro like Prefect).

If we wanted to, the customizations that we need to do could still happen in kedro, it would just all have to be architected such that the kedro pipeline would itself be a node within whatever orchestrator we choose? I’m not sure if there isn’t a way to “link up” the DAG that’s created from a Kedro pipeline to steps in a dag defined outside Kedro so long as it's the same overlying orchestrator tool?

JJuan Luis Cano Rodríguez

FTR there's an unmaintained kedro-argo plugin https://github.com/nraw/kedro-argo/ by

JJuan Luis Cano Rodríguez

it would just all have to be architected such that the kedro pipeline would itself be a node within whatever orchestrator we choose?

I would say so, yes. different people have different opinions on the level of granularity that's optimal, but we're observing that usually nodes are too small for a container, and that 1 pipeline = 1 container, or at least 1 coherent group of nodes = 1 container is more adequate

JJuan Luis Cano Rodríguez

several plugins adopt this grouping mechanism, for example kedro-airflow https://github.com/kedro-org/kedro-plugins/tree/main/kedro-airflow#can-i-group-nodes-together

RRyan Wendt

Gotcha, thanks for the info. I always appreciate the info and responsiveness on this channel 🙂.

Add a reply

Join the Kedro community

Evaluating Kedro for Data Engineering Processes