Join the Kedro community

Updated 2 months ago

Default memory dataset copy method prioritizes accuracy over efficiency

I have a question about the memory dataset's default copy method. I noticed that if the data is a pandas dataframe or a numpy array that copy rather than assignment (i.e. making a reference) is used by default. I'm wondering what the rationale for that is. Often making a reference is cheaper in terms of runtime than making either a shallow or deep copy. Why is assignment not the top priority default?

https://docs.kedro.org/en/stable/_modules/kedro/io/memory_dataset.html#MemoryDataset

D
G
Y
8 comments

The default goal is to preserve the same behavior, whether somebody uses a MemoryDataset or, say, pandas.ParquetDataset. It would be confusing if your pipeline started behaving differently based on how you configured your catalog.

It does make sense to me to have the same default behaviour where possible, but I think I am missing some premises to fully understand the default in MemoryDataset. Why isn't assignment the default regardless of dataset type?

Without a copy, pandas assignments can be unsafe: https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-view-versus-copy

This can't really happen with Spark, Polars, Ibis, etc.

Ah, thank you! I had forgotten about Pandas' approach. Thank you for sharing that documentation.

I have a follow-up question if you have time. To control this behaviour of assignment/copy/deep copy in my Kedro project, what is the conventional way to do that? Should I make a Kedro catalog entry with MemoryDataset as the dataset type?

Yep, sounds good!

That said, Kedro explicitly tríes to separate data transformation logic from I/O. You should probably document it clearly if you want to do this, so that somebody doesn't come along later, swap in a different dataset, and things behave weirdly.

Great! Thank you for answering my questions about this topic. 🙂

1 additional reasons + 1 comment :

  • Kedro pipelines used to be sorted non deterministically, and pandas data frame could be modified by different nodes. Running twice the same pipeline with the exact same configuration could lead to different results 🤯 The order is now deterministic but reason is still valid though
  • You can change the default behaviour with a factory in your catalog:
{default}:
type: Memory dataset
copy_mode: assign

Add a reply
Sign up and join the conversation on Slack