Join the Kedro community

Updated 3 days ago

Saving Multiple Dataframes to Different Sheets in an Excel Spreadsheet

Hello, I want to use a namespaced pipeline and data catalog to get a series of dataframes, do some manipulations, and then save them all in one Excel spreadsheet in different sheets. I thought something like this would work in the catalog:

"{namespace}.spreadsheet_data":
  type: pandas.ExcelDataset
  filepath: data/03_primary/all_data_sources.xlsx
  save_args:
    sheet_name: "{namespace}_data"
but this doesn't work. I just end up with a spreadsheet with one sheet - with the name of whatever namespace ran last. I.e. it must be overwriting it each time.

I have read that I will need to specify a writer if I want to write to a file that already exists (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html) but I can't get that to work.
Is what I would like to do possible?
Many thanks

R
A
m
4 comments

Hi Minesh, Thank you for raising the issue. Did you try using the append mode for the Excel Writer. I also see an option of if_sheet_exists which might help.

"{namespace}.spreadsheet_data":
  type: pandas.ExcelDataset
  filepath: data/03_primary/all_data_sources.xlsx
  save_args:
    writer:
      mode: a
      if_sheet_exists: new
    sheet_name: "{namespace}_data"
Something like above ?

Like mentioned, you can pass kwargs to the underlying load and save functions with save_args and load_args in the Data Catalog configuration. There’s a few examples on this page - https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html

Thank you both for you suggestions. I too thought it would be fairly straight forward. Especially given how easy it is to accomplish with just pd.ExcelWriter. However there were a couple of snags - "append" mode only works for spreadsheets that already exist. As far as I can see there is no "append if exists, but create first if it doesn't exist" option.

But even when I manually created the spreadsheet first I would get an "file is not a zip file" error which is some sort of openpyxl error (could be package version issue). In any case, I just got round the issue by created an custom dataset and just using pd.ExcelWriter in the save method there.

Thank you both for you help.

Add a reply
Sign up and join the conversation on Slack