Default memory dataset copy method prioritizes accuracy...

GGalen Seilis

I have a question about the memory dataset's default copy method. I noticed that if the data is a pandas dataframe or a numpy array that copy rather than assignment (i.e. making a reference) is used by default. I'm wondering what the rationale for that is. Often making a reference is cheaper in terms of runtime than making either a shallow or deep copy. Why is assignment not the top priority default?

https://docs.kedro.org/en/stable/_modules/kedro/io/memory_dataset.html#MemoryDataset

8 comments

DDeepyaman Datta

The default goal is to preserve the same behavior, whether somebody uses a MemoryDataset or, say, pandas.ParquetDataset. It would be confusing if your pipeline started behaving differently based on how you configured your catalog.

GGalen Seilis

It does make sense to me to have the same default behaviour where possible, but I think I am missing some premises to fully understand the default in MemoryDataset. Why isn't assignment the default regardless of dataset type?

DDeepyaman Datta

Without a copy, pandas assignments can be unsafe: https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-view-versus-copy

This can't really happen with Spark, Polars, Ibis, etc.

GGalen Seilis

Ah, thank you! I had forgotten about Pandas' approach. Thank you for sharing that documentation.

GGalen Seilis

I have a follow-up question if you have time. To control this behaviour of assignment/copy/deep copy in my Kedro project, what is the conventional way to do that? Should I make a Kedro catalog entry with MemoryDataset as the dataset type?

DDeepyaman Datta

Yep, sounds good!

That said, Kedro explicitly tríes to separate data transformation logic from I/O. You should probably document it clearly if you want to do this, so that somebody doesn't come along later, swap in a different dataset, and things behave weirdly.

GGalen Seilis

Great! Thank you for answering my questions about this topic. 🙂

YYolan Honoré-Rougé

1 additional reasons + 1 comment :

Kedro pipelines used to be sorted non deterministically, and pandas data frame could be modified by different nodes. Running twice the same pipeline with the exact same configuration could lead to different results 🤯 The order is now deterministic but reason is still valid though
You can change the default behaviour with a factory in your catalog:

{default}:

    type: Memory dataset

copy_mode: assign

Add a reply

Join on Slack

Join the Kedro community

Default memory dataset copy method prioritizes accuracy over efficiency