I'm trying to read a csv file (by chunks) and then save the result as a parquet partitioned files. The following catalog raises a DatasetError:
"{company}.{layer}.transactions": type: pandas.ParquetDataset filepath: data/{company}/{layer}/transactions save_args: partition_cols: [year, month]The error:
DatasetError: ParquetDataset does not support save argument 'partition_cols'. Please use '
kedro.io
.PartitionedDataset' instead.
Hi @Hugo Barreto, I am not exactly sure on the rationale behind why partition_cols
is not supported. May be @Nok or someone has a better idea as this has been around from the start. You can do this using PartitionedDatasets as mentioned here and use the
arg dataset: pandas.ParquetDataset
as the underlying dataset. Thank you
I am pretty sure this is mentioned recently (maybe it's an internal conversation I have seen elsewhere). This code is quite old, thus you see the error is explicitly raised for the argument. AFAIK, in the old days partitioning with parquet isn't working very well, which may be related to fast_parquet
engine. These day the default is pyarrow
and it is supported by pandas out of the box.