Hello everyone,

HHugo Acosta

Hello everyone,

I am working on a dynamic pipeline that generates a file for each year in a list, such that the catalog entry would be

data_{year}:
  type: pandas.ExcelDataset
  filepath: reports/folder/data_{year}.xlsx
  save_args:
    index: False

Then, I have another pipeline that aggregates all files to process them loading them as a PartitionedDataset, with entry:

partitioned_data:
  type: partitions.PartitionedDataset
  path: reports/folder
  dataset:
    type: pandas.ExcelDataset

The main problem with my approach is that even though these two entries refer to the same data, they are in fact different entries, so Kedro runs the second pipeline before the dynamic one.
I would appreciate your input on this issue,

Thanks a lot!

6 comments

NNok Lam Chan

Hi , thanks for the question.

The main problem with my approach is that even though these two entries refer to the same data, they are in fact different entries, so Kedro runs the second pipeline before the dynamic one.

Is it possible to use partition dataset instead of dynamic pipeline in this case?

I understand the reason for this to happen is that, if you try to visualise this pipeline with kedro viz, it will be a disconnect one so Kedro don't know that the 1st one need to be executed before the other. The other option is to create a fake dummy input/output pair, to ensure the dependencies is resolved correctly.

HHugo Acosta

Thanks a lot for the early answer!
I am a bit concerned that loading as a partition instead of looping through the files will cause memory issues, could you elaborate a bit on your suggestion?

NNok Lam Chan

which suggestion are you referring to?

HHugo Acosta

My concern is that by using a partition dataset instead of a dynamic pipeline I will encounter memory issues, since the data files are kinda heavy, so I wanted to know your take on this.

NNok Lam Chan

https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html
For partitioned dataset, you could use lazy loading/lazy saving to help with the memory issue.

If you prefer the dynamic pipeline way, it's totally fine, but as mentioned you would need a dummy input/output to control the execution order.

NNok Lam Chan

Side note: https://github.com/kedro-org/kedro/discussions/3758

There has been some discussion for adding custom execution order, feel free to comment if this is in your interest

Add a reply

Join the Kedro community

Hello everyone,