Join the Kedro community

M
M
M
D
M

Kедro node connection without dummy data

Hi everyone! I have a couple of questions about Kedro:

  1. I'm using an external Java tool to convert XML to linked data in one of my nodes, and the tool produces an output, but it's created outside of the Python function. Right now, I'm using a dummy dataset as an output and then using that as an input for the next node to make Kedro Viz visualize the connection properly. However, this feels a bit clumsy. Is there a more elegant way to sequentially connect nodes in Kedro without requiring a dataset in between?
  2. I would like to use Kedro for a project that performs the ETL for multiple institutes. I'm planning to use namespaces since the ETL process is similar for most institutes. After running the individual pipelines, there is part of the ETL that can either be run with the output from a single institute or sometimes needs to be run with the outputs from all institutes together. Currently, with a pure Python approach, we output each institute's data into a shared directory and then run the shared part using the content of that directory. However, Kedro doesn't allow multiple nodes to output to the same dataset (folder in this case). How could I connect the shared pipeline with each institute's pipeline in this case?
Thanks in advance for your help!

R
R
5 comments

  1. For the first question, I’m not sure if it’s possible, but perhaps other community members can help. The way Kedro-Viz DAG works is that every function has an input and returns an output, so it might be tricky to achieve without a dummy dataset in between.
  2. For the second question, you could consider using a PartitionedDataset, which allows outputting to a folder with multiple files. This could help handle the outputs from multiple pipelines.

Thanks a lot, ! The proposed solution for the question sounds promising, especially if it enables me to run the shared pipeline independently, without having to execute all the institutes' pipelines beforehand. As for point 1, I really hope we can find a solution soon --?it's quite cumbersome having to work with these dummy datasets...

did not work:

kedro.pipeline.pipeline.OutputNotUniqueError: Output(s) ['dummy_rdf_dir_dataset'] are returned by more than one nodes. Node outputs must be unique.
Even as PartitionedDataset dataset

Multiple nodes cannot have the same output. I recommend creating separate outputs for each node. Then, write a new node function that aggregates these outputs. The inputs to this new function will be the separate outputs, and its output can be a partitioned dataset.

thank you, yes, then I need too create several dummy datasets and things might get out of control. Anyone has other ideas on how to connect nodes without dataset output? any other suggestions?

Add a reply
Sign up and join the conversation on Slack
Join