Hey guys,
Me again π I had a question regarding parquet dataset itself. I often encounter issues with custom datatypes during saving. For instance if i have a custom class in my dataframe i would like to still keep this as is - the reason why i use parquet -. I know there needs to be customer serializer/deserilizar code required to do this. I can for sure do it in my code but since it's io related, i believe it should be done in the dataset definition where i can somehow point to my custom which gets serialized before writing to file. I will work on the extended version now, i was wondering if it was discussed before?. i am happy to push this as a PR later
If I summarised this right, the problem is you need some class which is not the usual schema that can be defined in the Yaml format?
Could you share the constructors of your current class, maybe using omegaconf resolver can already solve this? How is the custom serialiser regisrered?
So basically what I imagine is this:
data: type: pandas.ParquetDataset filepath: .... serialize_cls: projx.pipelines.MyClassthis new serialize_cls points to a custom data class that I create and is in my dataframe. During parquet save, i will invoke methods of my class to be able to write and read this my parquet dataset.
User then can define let's say _serialize_
and _deserialize_
functions which gets invoked pre/post load and save calls
I am just thinking if there is need from other people to add this into kedro, otherwise i can for sure implement a custom solution locally
or possibly with omegaconf:
data: type: pandas.ParquetDataset filepath: .... serializer: ... # omegaconf python code deserializer: ... # omegaconf python code
My question would be, how would you do that with pure python code without Kedro? How pandas is currently supporting this?
https://stackoverflow.com/questions/61271295/how-to-save-a-pandas-dataframe-with-custom-types-using-pyarrow-and-parquet
From this thread it seems like if you have the serialisation method implement properly, you don't need anything extra?
I def have to provide serialization method, solution is same with/without kedro. I am just porting this support to kedro reader/writer as i don't wanna do data conversion in my node since this isn't part of what the node function is supposed to do.
Let's us know if it works, if there are need for extending the current class feel free to open an issue and PR for this.
but I would say, serialiser
sounds like it belongs to save_args
and deserialiser
belong to load_args
. The argument available for specific dataset usually mirror what's the underlying API support. In this case which is pd.read_parquet
and pd.to
_parquet
Yes, you are right but pandas with pyarrow new dtype support is somehow super complicated and required deeper knowledge of underlying pandas code. I have a working version now but now that i think about it (see pic), this can be done with hooks. All I have to do check for parquet dataset and apply serializing before read/write π
Ah, I see what you are doing here. I think this is a smart way of handling it. It will be good if you can open an issue, I think this will be useful for other people who may have the same problem regardless.
I'll open an issue and share this there but im not sure if more required. Hooks are just awesome, sometimes i forget about them π