Subject: Dependency Issue Between Two Nodes in Kedro
Hello everyone,
I’m facing an issue regarding dependency management between two Nodes in Kedro, and I’d appreciate your insights 🙂
I have a Node A that is supposed to be a dependency for Node B. However, Node A does not return any data as output, which prevents me from creating an explicit link between these two Nodes in the pipeline. As a result, Node B can execute before Node A, which is not the desired behavior.
My question is: how can I force Kedro to treat Node B as dependent on Node A, even if Node A doesn’t produce any output data? Is there a clean way to define an explicit dependency between two Nodes in this case?
Thanks in advance for your help! 😊
Hi @Mohamed El Guendouz, you can use a simple trick, such as returning a text string as the output of Node A and using it as an input for Node B.
Hi @Dmitry Sorokin 🙂
Thank you for your suggestion! Using a text string as the output of Node A and passing it as an input to Node B is indeed a valid approach to enforce dependencies in Kedro. However, my use case has some specific nuances that make this challenging:
userflow --conf-source=conf-userflow.tar.gz --pipeline=<pipeline_name> --from-nodes=<node_A> --to-nodes=<node_A> --async userflow --conf-source=conf-userflow.tar.gz --pipeline=<pipeline_name> --from-nodes=<node_B> --to-nodes=<node_B> --async
MemoryDataSet
). These outputs can complicate DAG generation and cause failures in isolated Node executions.Thanks for the clarification, @Mohamed El Guendouz. I believe the only idea here is to save the text input/output in the Kedro DataCatalog
as a TextDataset
. This will allow you to execute your pipeline starting from node B as well - just ensure the file is in place. As you can see this is more of a workaround since Kedro assumes it can automatically resolve the node order based on inputs and outputs. As far as I know, this is the only way to establish the dependencies between nodes.
Thank you so much @Dmitry Sorokin, it worked! 🙌 Your suggestion to use the Kedro DataCatalog to save text inputs/outputs as a TextDataset was perfect. I was able to execute the pipeline starting from node B by making sure the file was in place. I understand that it's more of a workaround, but it works great for establishing dependencies between nodes. Thanks again for your help, it's awesome! 🎉
However, I’m wondering what would happen if multiple nodes tried to write to the same TextDataset at the same time. Do you know if Kedro handles this scenario well, or would we need to implement a workaround for each pipeline? This could end up cluttering the DataCatalog a bit.
I'm happy that it helps @Mohamed El Guendouz, in Kedro pipeline will be valid only if each output will be produces only by one node, so situation that you described is impossible. If you want to extend that workaround on many nodes, I can recommend to use dataset factories to minimize effects on catalog:
https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html