Join the Kedro community

Updated last month

Implementing a global partition filter for partitioned datasets

At a glance

Hi all! I have a pipeline where all datasets are PartitionedDataset. I want to implement a global filter on which partitions to process. I hoped I could drop some partitions in an after_dataset_loaded hook but that's not supported. Is there some other way for a global partition filter?

M
P
6 comments

Hi Paul, can you maybe describe a bit more what you're trying to do? The partitioned dataset is designed so that the node that consumes the PartitionedDataset implements the logic that defines what partitions need to be loaded, and how this data is going to be processed.

Hi Merel, I already have filter logic in the nodes that select the partitions to process. Now I realized that the logic gets more complicated and requires more information from command line parameters. I hoped that I could use a hook instead of passing all the filter parameters to every node. The partition keys that I want to process are the same in every node. So I hoped for a solution where I implement the filter logic just once.

ah I see. It's an interesting use case. Most out of the box features assume you'd like to filter by nodes, not datasets. I don't have an answer right now on how to best solve this.

Maybe you can use a parameter and pass that to all nodes that need to have the info of the partition?

Thanks. I will probably gather all information in one dict and pass that to all nodes.

I wonder if my use cases are so different from what kedro has been designed for. Having all datasets partitioned by day, hour, client or warehouse is pretty common in my cases. A typical pipeline runs every hour processing the last hours data. A second thing I'm missing in this context is computing output partitions only if they don't exist yet.

Add a reply
Sign up and join the conversation on Slack