Custom datatypes in parquet dataset causing issues

At a glance

Hey guys,

Me again 😄 I had a question regarding parquet dataset itself. I often encounter issues with custom datatypes during saving. For instance if i have a custom class in my dataframe i would like to still keep this as is - the reason why i use parquet -. I know there needs to be customer serializer/deserilizar code required to do this. I can for sure do it in my code but since it's io related, i believe it should be done in the dataset definition where i can somehow point to my custom which gets serialized before writing to file. I will work on the extended version now, i was wondering if it was discussed before?. i am happy to push this as a PR later

16 comments

NNok Lam Chan

If I summarised this right, the problem is you need some class which is not the usual schema that can be defined in the Yaml format?

Could you share the constructors of your current class, maybe using omegaconf resolver can already solve this? How is the custom serialiser regisrered?

FFazil Topal

So basically what I imagine is this:

data:
  type: pandas.ParquetDataset
  filepath: ....
  serialize_cls: projx.pipelines.MyClass

this new serialize_cls points to a custom data class that I create and is in my dataframe. During parquet save, i will invoke methods of my class to be able to write and read this my parquet dataset.

Right now, only default data types work in parquet so anything custom don't get serialized. Hence the solution

FFazil Topal

User then can define let's say _serialize_ and _deserialize_ functions which gets invoked pre/post load and save calls

FFazil Topal

I am just thinking if there is need from other people to add this into kedro, otherwise i can for sure implement a custom solution locally

FFazil Topal

or possibly with omegaconf:

data:
  type: pandas.ParquetDataset
  filepath: ....
  serializer: ... # omegaconf python code
   deserializer: ... # omegaconf python code

NNok Lam Chan

My question would be, how would you do that with pure python code without Kedro? How pandas is currently supporting this?

NNok Lam Chan

https://stackoverflow.com/questions/61271295/how-to-save-a-pandas-dataframe-with-custom-types-using-pyarrow-and-parquet

From this thread it seems like if you have the serialisation method implement properly, you don't need anything extra?

FFazil Topal

I def have to provide serialization method, solution is same with/without kedro. I am just porting this support to kedro reader/writer as i don't wanna do data conversion in my node since this isn't part of what the node function is supposed to do.

FFazil Topal

I think omegaconf resolver could work fine, im testing a solution atm

NNok Lam Chan

Let's us know if it works, if there are need for extending the current class feel free to open an issue and PR for this.

NNok Lam Chan

but I would say, serialiser sounds like it belongs to save_args and deserialiser belong to load_args. The argument available for specific dataset usually mirror what's the underlying API support. In this case which is pd.read_parquet and pd.to_parquet

FFazil Topal

Yes, you are right but pandas with pyarrow new dtype support is somehow super complicated and required deeper knowledge of underlying pandas code. I have a working version now but now that i think about it (see pic), this can be done with hooks. All I have to do check for parquet dataset and apply serializing before read/write 😄

Attachment

NNok Lam Chan

Ah, I see what you are doing here. I think this is a smart way of handling it. It will be good if you can open an issue, I think this will be useful for other people who may have the same problem regardless.

FFazil Topal

yes so this works like a charm:

Attachment

FFazil Topal

I'll open an issue and share this there but im not sure if more required. Hooks are just awesome, sometimes i forget about them 😄

FFazil Topal

Maybe we can just put this into the documentation? That way they can discover this

Add a reply

Join the Kedro community

Custom datatypes in parquet dataset causing issues