Join the Kedro community

Updated 2 months ago

Hello everyone,

Hello everyone,

I am working on a dynamic pipeline that generates a file for each year in a list, such that the catalog entry would be

data_{year}:
  type: pandas.ExcelDataset
  filepath: reports/folder/data_{year}.xlsx
  save_args:
    index: False
Then, I have another pipeline that aggregates all files to process them loading them as a PartitionedDataset, with entry:

partitioned_data:
  type: partitions.PartitionedDataset
  path: reports/folder
  dataset:
    type: pandas.ExcelDataset
The main problem with my approach is that even though these two entries refer to the same data, they are in fact different entries, so Kedro runs the second pipeline before the dynamic one.
I would appreciate your input on this issue,

Thanks a lot!

N
H
6 comments

Hi , thanks for the question.

The main problem with my approach is that even though these two entries refer to the same data, they are in fact different entries, so Kedro runs the second pipeline before the dynamic one.
Is it possible to use partition dataset instead of dynamic pipeline in this case?

I understand the reason for this to happen is that, if you try to visualise this pipeline with kedro viz, it will be a disconnect one so Kedro don't know that the 1st one need to be executed before the other. The other option is to create a fake dummy input/output pair, to ensure the dependencies is resolved correctly.

Thanks a lot for the early answer!
I am a bit concerned that loading as a partition instead of looping through the files will cause memory issues, could you elaborate a bit on your suggestion?

which suggestion are you referring to?

My concern is that by using a partition dataset instead of a dynamic pipeline I will encounter memory issues, since the data files are kinda heavy, so I wanted to know your take on this.

https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html
For partitioned dataset, you could use lazy loading/lazy saving to help with the memory issue.

If you prefer the dynamic pipeline way, it's totally fine, but as mentioned you would need a dummy input/output to control the execution order.

Side note: https://github.com/kedro-org/kedro/discussions/3758

There has been some discussion for adding custom execution order, feel free to comment if this is in your interest

Add a reply
Sign up and join the conversation on Slack