Join the Kedro community

Best practice for rerunning clustering pipeline with different timestamps

Hi all!

I am working with a clustering pipeline that I regularly want to rerun to monitor cluster migrations. I am using SnowflakeTableDatasets to save data directly to the data warehouse. Now, since it is not possible to have the same input and output dataset in Kedro, I was wondering what would be best practice to rerun clustering and store to the same SnowparkTableDataset when storing on a different timestamp for example. Would appreciate your help here!

5 comments

RRavi Kumar Pilla

Hi , From your use case, I found PartitionedDataset and IncrementalDataset to be helpful. If you haven't tried already, please check the docs here - https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html .

Also, if your clustering pipeline runs the entire dataset and you want to work with different versions, you can try versioning in catalog. Thank you

TThomas d'Hooghe

Hi Ravi, thank you for your quick response! That looks promising indeed. Any chance you have ever tested this to work with a SnowparkTableDataset already? Responding to the versioning, I thought with or without versioning, it is not possible to have the same dataset as both input and output. Are you saying that with versioning this constraint is relieved?

RRavi Kumar Pilla

Oh yes, I think kedro does not allow same datasets to be both inputs and outputs. I haven't tried incremental datasets before. Also I was wondering if I understood your question -

You have a pipeline which has a node that takes in dataset x -> dataset x ? or
You have a pipeline which has a node that takes in dataset x -> dataset x_with_timestamp ? and then the next iteration would take dataset x_with_timestamp as input

TThomas d'Hooghe

I think both would work, but the latter one would be a bit more clean. Also am wondering what the community thinks what the best solution will be in this case :)

mmarrrcin

You cannot have the same input and output datasets but 2 differently named data catalog entries can point to the same underlying resource (file/database etc)

Add a reply

Join on Slack