Join the Kedro community

Updated 6 hours ago

Partitioning Parquet Files with Kedro's PartitionedDataset

I'm trying to read a csv file (by chunks) and then save the result as a parquet partitioned files. The following catalog raises a DatasetError:

"{company}.{layer}.transactions":
  type: pandas.ParquetDataset
  filepath: data/{company}/{layer}/transactions
  save_args:
    partition_cols: [year, month]    
The error:
DatasetError: ParquetDataset does not support save argument 'partition_cols'. Please use 'kedro.io.PartitionedDataset' instead.

How am I supposed to do it using PartitionedDatasets and what is the reason behind blocking the use of partition_cols in pandas.ParquetDataset (I'm asking because i could just override it with a custom Dataset)?

R
H
N
6 comments

Hi @Hugo Barreto, I am not exactly sure on the rationale behind why partition_cols is not supported. May be @Nok or someone has a better idea as this has been around from the start. You can do this using PartitionedDatasets as mentioned here and use the
arg dataset: pandas.ParquetDataset as the underlying dataset. Thank you

Thanks for pointing me out

Please feel free to raise an issue/PR to fix this.

I am pretty sure this is mentioned recently (maybe it's an internal conversation I have seen elsewhere). This code is quite old, thus you see the error is explicitly raised for the argument. AFAIK, in the old days partitioning with parquet isn't working very well, which may be related to fast_parquet engine. These day the default is pyarrow and it is supported by pandas out of the box.

TL;DR, I think it's an outdated guardrail that should be removed

Add a reply
Sign up and join the conversation on Slack