Hello, I want to use a namespaced pipeline and data catalog to get a series of dataframes, do some manipulations, and then save them all in one Excel spreadsheet in different sheets. I thought something like this would work in the catalog:
"{namespace}.spreadsheet_data": type: pandas.ExcelDataset filepath: data/03_primary/all_data_sources.xlsx save_args: sheet_name: "{namespace}_data"but this doesn't work. I just end up with a spreadsheet with one sheet - with the name of whatever namespace ran last. I.e. it must be overwriting it each time.
Hi Minesh, Thank you for raising the issue. Did you try using the append mode for the Excel Writer. I also see an option of if_sheet_exists
which might help.
"{namespace}.spreadsheet_data": type: pandas.ExcelDataset filepath: data/03_primary/all_data_sources.xlsx save_args: writer: mode: a if_sheet_exists: new sheet_name: "{namespace}_data"Something like above ?
Like mentioned, you can pass kwargs
to the underlying load and save functions with save_args
and load_args
in the Data Catalog configuration. There’s a few examples on this page - https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html
Thank you both for you suggestions. I too thought it would be fairly straight forward. Especially given how easy it is to accomplish with just pd.ExcelWriter. However there were a couple of snags - "append" mode only works for spreadsheets that already exist. As far as I can see there is no "append if exists, but create first if it doesn't exist" option.
But even when I manually created the spreadsheet first I would get an "file is not a zip file" error which is some sort of openpyxl error (could be package version issue). In any case, I just got round the issue by created an custom dataset and just using pd.ExcelWriter in the save method there.
Thank you both for you help.