Join the Kedro community

Updated 4 days ago

Appending Rows to a CSV File with Datacatalog

Hey guys, I m having trouble to append a CSV with the datacatalog. My node is returning a DataFrame with one row and multiple metricnames as columns. It writes the results.csv to the folder accordingly but it doesnt append the rows. In addition, a blank row is created after the first row (might indicate the flaw? ) When I debugg step by step, both dataframes get written to the csv but are overwritten by each other.
Metric | Seed
--------|-------
1.0 | 42

results.update(
        {
            "seed": seed,
        }
    )
return = pd.DataFrame.from_dict([results])

My catalog has the save_arg mode set to "a"
"{engine}.{variant}.results":
  type: pandas.CSVDataset  # Underlying dataset type (CSV).
  filepath: data/08_reporting/{engine}/results.csv  # Path to the CSV file.
  save_args:
    mode: "a"  # Append mode for saving the CSV file.

q
P
N
7 comments

I am not the member of the kedro team (don't know if it's ok if I respond)

I am not sure but maybe it has something to do with the default value of header param of to_csv method (it is True by default). So maybe adding header: False to the save_args will help.
In this case I don't know how to preserve the header row in the resulting csv (because I assume that if you just set header to False you will end up with a csv without header row)

Also maybe it has something to do with the way CSVDataset opens the file. Here is a snippet from the docs

with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
    data.to_csv(path_or_buf=fs_file, **self._save_args)
So maybe you would want to add mode: "a" also to the open_args_save dictionary in your YAML config

And the blank lines might be related to the wrong lineterminator value. Try adding lineterminator: "\n" to the save_args

Thank you very much for your input. The lineterminator in fact did remove the blank row.
Unfortunately adding load_args: mode:"a" did not do the trick. I tried both approaches with header = true|false

My current workaround is a hook which opens/creates the file and concats it. But it feels kinda hacky especially because the CSVDataset has the argument for appending.

Unfortunately adding load_args: mode:"a" did not do the trick.
Maybe it's just a typo but just to make sure, I was suggesting to add open_args_save like this

"{engine}.{variant}.results":
  type: pandas.CSVDataset  # Underlying dataset type (CSV).
  filepath: data/08_reporting/{engine}/results.csv  # Path to the CSV file.
  save_args:
    mode: "a"  # Append mode for saving the CSV file.
  open_args_save: <-- HERE
    mode: "a" # <-- AND HERE
Because as you can see from the snippet I provided previously pandas.CSVDataset first opens the file with some fs module and by default it uses those args (you can find this in the sources)
DEFAULT_FS_ARGS: dict[str, Any] = {"open_args_save": {"mode": "w"}}
It then passes the opened file as a buffer (perhaps?) to the to_csv method. So, it seems that the file might be overridden not by pandas, but by the fs module.

Oh, sorry, I provided wrong config. It should be like this

"{engine}.{variant}.results":
  type: pandas.CSVDataset  # Underlying dataset type (CSV).
  filepath: data/08_reporting/{engine}/results.csv  # Path to the CSV file.
  save_args:
    mode: "a"  # Append mode for saving the CSV file.
  fs_args: <-- I MISSED THIS ONE
    open_args_save: <-- HERE
      mode: "a" # <-- AND HERE
because the parameter name is fs_args and from the documentation
  • fs_args (Optional[dict[str, Any]]) – Extra arguments to pass into underlying filesystem class constructor (e.g. {“project”: “my-project”} for GCSFileSystem). Defaults are preserved, apart from the open_args_save mode which is set to w.

Thanks for your effort. It kind of works. With some limitations. What I found out so far is:

The file needs to be present in the folder to get overwritten. Otherwise the following error will occur (even when the file doesnt exist):

kedro.io.core.DatasetError: Cannot save versioned dataset 'results.csv' to 'C:/___/data/08_reporting/tests' because a file with the same name already exists in the directory. This is likely because versioning was enabled on a dataset already saved previously. Either remove 'results.csv' from the directory or manually convert it into a versioned dataset by placing it in a versioned directory

This can be done manually or by setting open_args_save:mode:"w" but by setting "w" the file wont append, so it has to be changed back again.
In addition, all columnheaders will be appended as well, if they are provided. so it bloats the csv with it.

I guess the easiest way should be to remove headers from the returned data from the node and append a manually placed csv with the columns that are need. Sadly this approach isnt realy flexible. Kinda curious if there is another way to populate the csv with dataframes containing headers.

It's strange that mode: "a" does not create a file if it does not exist.

Maybe as a workaround you can create a "utility" node and a dataset that will create some empty csv with header row before the actual node will run so by the time your actual node starts its work the file will be already there.

https://github.com/kedro-org/kedro-plugins/issues/513
I remember there was issue with CSVDataset particularly, which seems to be fixed now with: https://github.com/kedro-org/kedro-plugins/pull/805

Cc @Merel

Add a reply
Sign up and join the conversation on Slack