Join the Kedro community

Updated 2 months ago

Implementing a global partition filter for partitioned datasets

At a glance

The community member has a pipeline where all datasets are PartitionedDataset and wants to implement a global filter on which partitions to process. They hoped to drop some partitions in an after_dataset_loaded hook, but that's not supported. Another community member suggests that the node consuming the PartitionedDataset should implement the logic to define what partitions need to be loaded and how the data will be processed.

The original community member explains that they already have filter logic in the nodes, but the logic is becoming more complicated and requires more information from command line parameters. They hoped to use a hook instead of passing all the filter parameters to every node, as the partition keys they want to process are the same in every node.

The community members discuss that this is an interesting use case, as most out-of-the-box features assume filtering by nodes, not datasets. One community member suggests using a parameter and passing it to all nodes that need the partition information. The original community member says they will probably gather all the information in one dictionary and pass it to all nodes.

The original community member wonders if their use cases are different from what Kedro has been designed for, as having all datasets partitioned by day, hour, client, or warehouse is pretty common in their cases. They also mention missing the ability to compute output partitions only if

Hi all! I have a pipeline where all datasets are PartitionedDataset. I want to implement a global filter on which partitions to process. I hoped I could drop some partitions in an after_dataset_loaded hook but that's not supported. Is there some other way for a global partition filter?

M
P
6 comments

Hi Paul, can you maybe describe a bit more what you're trying to do? The partitioned dataset is designed so that the node that consumes the PartitionedDataset implements the logic that defines what partitions need to be loaded, and how this data is going to be processed.

Hi Merel, I already have filter logic in the nodes that select the partitions to process. Now I realized that the logic gets more complicated and requires more information from command line parameters. I hoped that I could use a hook instead of passing all the filter parameters to every node. The partition keys that I want to process are the same in every node. So I hoped for a solution where I implement the filter logic just once.

ah I see. It's an interesting use case. Most out of the box features assume you'd like to filter by nodes, not datasets. I don't have an answer right now on how to best solve this.

Maybe you can use a parameter and pass that to all nodes that need to have the info of the partition?

Thanks. I will probably gather all information in one dict and pass that to all nodes.

I wonder if my use cases are so different from what kedro has been designed for. Having all datasets partitioned by day, hour, client or warehouse is pretty common in my cases. A typical pipeline runs every hour processing the last hours data. A second thing I'm missing in this context is computing output partitions only if they don't exist yet.

Add a reply
Sign up and join the conversation on Slack