Partitioning Parquet Files with Kedro's PartitionedData...

At a glance

I'm trying to read a csv file (by chunks) and then save the result as a parquet partitioned files. The following catalog raises a DatasetError:

"{company}.{layer}.transactions":
  type: pandas.ParquetDataset
  filepath: data/{company}/{layer}/transactions
  save_args:
    partition_cols: [year, month]

The error:
DatasetError: ParquetDataset does not support save argument 'partition_cols'. Please use 'kedro.io.PartitionedDataset' instead.

How am I supposed to do it using PartitionedDatasets and what is the reason behind blocking the use of partition_cols in pandas.ParquetDataset (I'm asking because i could just override it with a custom Dataset)?

6 comments

RRavi Kumar Pilla

Hi @Hugo Barreto, I am not exactly sure on the rationale behind why partition_cols is not supported. May be @Nok or someone has a better idea as this has been around from the start. You can do this using PartitionedDatasets as mentioned here and use the
arg dataset: pandas.ParquetDataset as the underlying dataset. Thank you

HHugo Barreto

Thanks for pointing me out

NNok Lam Chan

Please feel free to raise an issue/PR to fix this.

NNok Lam Chan

I am pretty sure this is mentioned recently (maybe it's an internal conversation I have seen elsewhere). This code is quite old, thus you see the error is explicitly raised for the argument. AFAIK, in the old days partitioning with parquet isn't working very well, which may be related to fast_parquet engine. These day the default is pyarrow and it is supported by pandas out of the box.

NNok Lam Chan

TL;DR, I think it's an outdated guardrail that should be removed

Add a reply

Join the Kedro community

Partitioning Parquet Files with Kedro's PartitionedDataset