Appending Rows to a CSV File with Datacatalog

Question

Hey guys, I m having trouble to append a  CSV  with the datacatalog. My node is returning a  DataFrame  with one row and multiple metricnames as columns. It writes the results.csv to the folder accordingly but it doesnt append the rows. In addition, a blank row is created after the first row (might indicate the flaw? ) When I debugg step by step, both dataframes get written to the csv but are overwritten by each other. Metric | Seed --------|-------    1.0   |   42 results.update(
        {
            "seed": seed,
        }
    )
return = pd.DataFrame.from_dict([results]) My catalog has the save_arg mode set to "a" "{engine}.{variant}.results":
  type: pandas.CSVDataset  # Underlying dataset type (CSV).
  filepath: data/08_reporting/{engine}/results.csv  # Path to the CSV file.
  save_args:
    mode: "a"  # Append mode for saving the CSV file.

qwerty · Answer

I am not the member of the kedro team (don't know if it's ok if I respond) I am not sure but maybe it has something to do with the default value of  header  param of  to_csv  method (it is  True  by default). So maybe adding  header: False  to the  save_args  will help. In this case I don't know how to preserve the header row in the resulting csv (because I assume that if you just set  header  to False you will end up with a csv without header row) Also maybe it has something to do with the way CSVDataset opens the file. Here is a snippet from the  docs with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
    data.to_csv(path_or_buf=fs_file, **self._save_args) So maybe you would want to add  mode: "a"  also to the  open_args_save  dictionary in your YAML config And the blank lines might be related to the wrong  lineterminator  value. Try adding  lineterminator: "
"  to the  save_args

Philipp Dahlke · Answer

Thank you very much for your input. The lineterminator in fact did remove the blank row.
Unfortunately adding load_args: mode:"a" did not do the trick. I tried both approaches with header = true|false

My current workaround is a hook which opens/creates the file and concats it. But it feels kinda hacky especially because the CSVDataset has the argument for appending.

qwerty · Answer

Unfortunately adding  load_args: mode:"a"  did not do the trick. Maybe it's just a typo but just to make sure, I was suggesting to add  open_args_save   like this "{engine}.{variant}.results":
  type: pandas.CSVDataset  # Underlying dataset type (CSV).
  filepath: data/08_reporting/{engine}/results.csv  # Path to the CSV file.
  save_args:
    mode: "a"  # Append mode for saving the CSV file.
  open_args_save:  < -- HERE
    mode: "a" #  < -- AND HERE Because as you can see from the snippet I provided previously  pandas.CSVDataset  first opens the file with some  fs  module and by default it uses those args (you can find this in the sources) DEFAULT_FS_ARGS: dict[str, Any] = {"open_args_save": {"mode": "w"}} It then passes the opened file as a buffer (perhaps?) to the to_csv method. So, it seems that the file might be overridden not by pandas, but by the fs module.

qwerty · Answer

Oh, sorry, I provided wrong config. It should be like this "{engine}.{variant}.results":
  type: pandas.CSVDataset  # Underlying dataset type (CSV).
  filepath: data/08_reporting/{engine}/results.csv  # Path to the CSV file.
  save_args:
    mode: "a"  # Append mode for saving the CSV file.
  fs_args:  < -- I MISSED THIS ONE
    open_args_save:  < -- HERE
      mode: "a" #  < -- AND HERE because the parameter name is  fs_args   and from the  documentation fs_args  ( Optional [ dict [ str ,  Any ]]) – Extra arguments to pass into underlying filesystem class constructor (e.g. {“project”: “my-project”} for  GCSFileSystem ). Defaults are preserved, apart from the  open_args_save mode which is set to w.

Philipp Dahlke · Answer

Thanks for your effort. It kind of works. With some limitations. What I found out so far is:

The file needs to be present in the folder to get overwritten. Otherwise the following error will occur (even when the file doesnt exist):

kedro.io.core.DatasetError: Cannot save versioned dataset 'results.csv' to 'C:/___/data/08_reporting/tests' because a file with the same name already exists in the directory. This is likely because versioning was enabled on a dataset already saved previously. Either remove 'results.csv' from the directory or manually convert it into a versioned dataset by placing it in a versioned directory

This can be done manually or by setting open_args_save:mode:"w" but by setting "w" the file wont append, so it has to be changed back again.
In addition, all columnheaders will be appended as well, if they are provided. so it bloats the csv with it.

I guess the easiest way should be to remove headers from the returned data from the node and append a manually placed csv with the columns that are need. Sadly this approach isnt realy flexible. Kinda curious if there is another way to populate the csv with dataframes containing headers.

qwerty · Answer

It's strange that  mode: "a"  does not create a file if it does not exist. Maybe as a workaround you can create a "utility" node and a dataset that will create some empty csv with header row before the actual node will run so by the time your actual node starts its work the file will be already there.

Nok Lam Chan · Answer

https://github.com/kedro-org/kedro-plugins/issues/513
I remember there was issue with CSVDataset particularly, which seems to be fixed now with: https://github.com/kedro-org/kedro-plugins/pull/805

Cc @Merel

Join the Kedro community

Appending Rows to a CSV File with Datacatalog