Join the Kedro community

Updated 3 weeks ago

Dependency Issue Between Two Nodes in Kedro

Subject: Dependency Issue Between Two Nodes in Kedro

Hello everyone,

I’m facing an issue regarding dependency management between two Nodes in Kedro, and I’d appreciate your insights 🙂

I have a Node A that is supposed to be a dependency for Node B. However, Node A does not return any data as output, which prevents me from creating an explicit link between these two Nodes in the pipeline. As a result, Node B can execute before Node A, which is not the desired behavior.

My question is: how can I force Kedro to treat Node B as dependent on Node A, even if Node A doesn’t produce any output data? Is there a clean way to define an explicit dependency between two Nodes in this case?

Thanks in advance for your help! 😊

D
M
6 comments

Hi @Mohamed El Guendouz, you can use a simple trick, such as returning a text string as the output of Node A and using it as an input for Node B.

Hi @Dmitry Sorokin 🙂

Thank you for your suggestion! Using a text string as the output of Node A and passing it as an input to Node B is indeed a valid approach to enforce dependencies in Kedro. However, my use case has some specific nuances that make this challenging:

  1. Node Execution in Isolation:
  • In my workflow, I often execute individual Nodes independently using the following command:
userflow --conf-source=conf-userflow.tar.gz --pipeline=<pipeline_name> --from-nodes=<node_A> --to-nodes=<node_A> --async
userflow --conf-source=conf-userflow.tar.gz --pipeline=<pipeline_name> --from-nodes=<node_B> --to-nodes=<node_B> --async
  • If Node A produces a text string as output, Kedro will fail when running Node B directly, as it would require Node A to have executed beforehand to generate the dataset.
  1. Airflow DAG Generation:
  • I am generating an Airflow DAG from the Kedro pipeline. I need to ensure that Node B depends on Node A, but this dependency should be inferred directly from Kedro and not rely on the presence of datasets like dummy outputs (e.g., MemoryDataSet). These outputs can complicate DAG generation and cause failures in isolated Node executions.

That said, if you have any other ideas or approaches that could help address this challenge, I’d love to hear them! 😊

Thanks for the clarification, @Mohamed El Guendouz. I believe the only idea here is to save the text input/output in the Kedro DataCatalog as a TextDataset. This will allow you to execute your pipeline starting from node B as well - just ensure the file is in place. As you can see this is more of a workaround since Kedro assumes it can automatically resolve the node order based on inputs and outputs. As far as I know, this is the only way to establish the dependencies between nodes.

Thank you so much @Dmitry Sorokin, it worked! 🙌 Your suggestion to use the Kedro DataCatalog to save text inputs/outputs as a TextDataset was perfect. I was able to execute the pipeline starting from node B by making sure the file was in place. I understand that it's more of a workaround, but it works great for establishing dependencies between nodes. Thanks again for your help, it's awesome! 🎉

However, I’m wondering what would happen if multiple nodes tried to write to the same TextDataset at the same time. Do you know if Kedro handles this scenario well, or would we need to implement a workaround for each pipeline? This could end up cluttering the DataCatalog a bit.

I'm happy that it helps @Mohamed El Guendouz, in Kedro pipeline will be valid only if each output will be produces only by one node, so situation that you described is impossible. If you want to extend that workaround on many nodes, I can recommend to use dataset factories to minimize effects on catalog:
https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html

Add a reply
Sign up and join the conversation on Slack