I have a question about the memory dataset's default copy method. I noticed that if the data is a pandas dataframe or a numpy array that copy rather than assignment (i.e. making a reference) is used by default. I'm wondering what the rationale for that is. Often making a reference is cheaper in terms of runtime than making either a shallow or deep copy. Why is assignment not the top priority default?
https://docs.kedro.org/en/stable/_modules/kedro/io/memory_dataset.html#MemoryDataset
The default goal is to preserve the same behavior, whether somebody uses a MemoryDataset
or, say, pandas.ParquetDataset
. It would be confusing if your pipeline started behaving differently based on how you configured your catalog.
It does make sense to me to have the same default behaviour where possible, but I think I am missing some premises to fully understand the default in MemoryDataset. Why isn't assignment the default regardless of dataset type?
Without a copy, pandas assignments can be unsafe: https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-view-versus-copy
This can't really happen with Spark, Polars, Ibis, etc.
Ah, thank you! I had forgotten about Pandas' approach. Thank you for sharing that documentation.
I have a follow-up question if you have time. To control this behaviour of assignment/copy/deep copy in my Kedro project, what is the conventional way to do that? Should I make a Kedro catalog entry with MemoryDataset
as the dataset type?
Yep, sounds good!
That said, Kedro explicitly tríes to separate data transformation logic from I/O. You should probably document it clearly if you want to do this, so that somebody doesn't come along later, swap in a different dataset, and things behave weirdly.
1 additional reasons + 1 comment :
{default}:
type: Memory dataset
copy_mode: assign