Join the Kedro community

Updated 2 months ago

Custom datatypes in parquet dataset causing issues

Hey guys,

Me again πŸ˜„ I had a question regarding parquet dataset itself. I often encounter issues with custom datatypes during saving. For instance if i have a custom class in my dataframe i would like to still keep this as is - the reason why i use parquet -. I know there needs to be customer serializer/deserilizar code required to do this. I can for sure do it in my code but since it's io related, i believe it should be done in the dataset definition where i can somehow point to my custom which gets serialized before writing to file. I will work on the extended version now, i was wondering if it was discussed before?. i am happy to push this as a PR later

N
F
16 comments

If I summarised this right, the problem is you need some class which is not the usual schema that can be defined in the Yaml format?

Could you share the constructors of your current class, maybe using omegaconf resolver can already solve this? How is the custom serialiser regisrered?

So basically what I imagine is this:

data:
  type: pandas.ParquetDataset
  filepath: ....
  serialize_cls: projx.pipelines.MyClass
this new serialize_cls points to a custom data class that I create and is in my dataframe. During parquet save, i will invoke methods of my class to be able to write and read this my parquet dataset.

Right now, only default data types work in parquet so anything custom don't get serialized. Hence the solution

User then can define let's say _serialize_ and _deserialize_ functions which gets invoked pre/post load and save calls

I am just thinking if there is need from other people to add this into kedro, otherwise i can for sure implement a custom solution locally

or possibly with omegaconf:

data:
  type: pandas.ParquetDataset
  filepath: ....
  serializer: ... # omegaconf python code
   deserializer: ... # omegaconf python code

My question would be, how would you do that with pure python code without Kedro? How pandas is currently supporting this?

https://stackoverflow.com/questions/61271295/how-to-save-a-pandas-dataframe-with-custom-types-using-pyarrow-and-parquet

From this thread it seems like if you have the serialisation method implement properly, you don't need anything extra?

I def have to provide serialization method, solution is same with/without kedro. I am just porting this support to kedro reader/writer as i don't wanna do data conversion in my node since this isn't part of what the node function is supposed to do.

I think omegaconf resolver could work fine, im testing a solution atm

Let's us know if it works, if there are need for extending the current class feel free to open an issue and PR for this.

but I would say, serialiser sounds like it belongs to save_args and deserialiser belong to load_args. The argument available for specific dataset usually mirror what's the underlying API support. In this case which is pd.read_parquet and pd.to_parquet

Yes, you are right but pandas with pyarrow new dtype support is somehow super complicated and required deeper knowledge of underlying pandas code. I have a working version now but now that i think about it (see pic), this can be done with hooks. All I have to do check for parquet dataset and apply serializing before read/write πŸ˜„

Attachment
image.png

Ah, I see what you are doing here. I think this is a smart way of handling it. It will be good if you can open an issue, I think this will be useful for other people who may have the same problem regardless.

yes so this works like a charm:

Attachment
image.png

I'll open an issue and share this there but im not sure if more required. Hooks are just awesome, sometimes i forget about them πŸ˜„

Maybe we can just put this into the documentation? That way they can discover this

Add a reply
Sign up and join the conversation on Slack