Hey guys, I m having trouble to append a CSV
with the datacatalog. My node is returning a DataFrame
with one row and multiple metricnames as columns. It writes the results.csv to the folder accordingly but it doesnt append the rows. In addition, a blank row is created after the first row (might indicate the flaw? ) When I debugg step by step, both dataframes get written to the csv but are overwritten by each other.
Metric | Seed
--------|-------
1.0 | 42
results.update( { "seed": seed, } ) return = pd.DataFrame.from_dict([results])
"{engine}.{variant}.results": type: pandas.CSVDataset # Underlying dataset type (CSV). filepath: data/08_reporting/{engine}/results.csv # Path to the CSV file. save_args: mode: "a" # Append mode for saving the CSV file.
I am not the member of the kedro team (don't know if it's ok if I respond)
I am not sure but maybe it has something to do with the default value of header
param of to_csv
method (it is True
by default). So maybe adding header: False
to the save_args
will help.
In this case I don't know how to preserve the header row in the resulting csv (because I assume that if you just set header
to False you will end up with a csv without header row)
Also maybe it has something to do with the way CSVDataset opens the file. Here is a snippet from the docs
with self._fs.open(save_path, **self._fs_open_args_save) as fs_file: data.to_csv(path_or_buf=fs_file, **self._save_args)So maybe you would want to add
mode: "a"
also to the open_args_save
dictionary in your YAML configlineterminator
value. Try adding lineterminator: "\n"
to the save_args
Thank you very much for your input. The lineterminator
in fact did remove the blank row.
Unfortunately adding load_args: mode:"a"
did not do the trick. I tried both approaches with header = true|false
My current workaround is a hook which opens/creates the file and concats it. But it feels kinda hacky especially because the CSVDataset
has the argument for appending.
Unfortunately adding load_args: mode:"a"
did not do the trick.
Maybe it's just a typo but just to make sure, I was suggesting to add open_args_save
like this"{engine}.{variant}.results": type: pandas.CSVDataset # Underlying dataset type (CSV). filepath: data/08_reporting/{engine}/results.csv # Path to the CSV file. save_args: mode: "a" # Append mode for saving the CSV file. open_args_save: <-- HERE mode: "a" # <-- AND HEREBecause as you can see from the snippet I provided previously
pandas.CSVDataset
first opens the file with some fs
module and by default it uses those args (you can find this in the sources)DEFAULT_FS_ARGS: dict[str, Any] = {"open_args_save": {"mode": "w"}}It then passes the opened file as a buffer (perhaps?) to the to_csv method. So, it seems that the file might be overridden not by pandas, but by the fs module.
Oh, sorry, I provided wrong config. It should be like this
"{engine}.{variant}.results": type: pandas.CSVDataset # Underlying dataset type (CSV). filepath: data/08_reporting/{engine}/results.csv # Path to the CSV file. save_args: mode: "a" # Append mode for saving the CSV file. fs_args: <-- I MISSED THIS ONE open_args_save: <-- HERE mode: "a" # <-- AND HEREbecause the parameter name is
fs_args
and from the documentationOptional
[dict
[str
, Any
]]) – Extra arguments to pass into underlying filesystem class constructor (e.g. {“project”: “my-project”} for GCSFileSystem
). Defaults are preserved, apart from the open_args_save mode which is set to w.Thanks for your effort. It kind of works. With some limitations. What I found out so far is:
The file needs to be present in the folder to get overwritten. Otherwise the following error will occur (even when the file doesnt exist):
kedro.io.core.DatasetError: Cannot save versioned dataset 'results.csv' to 'C:/___/data/08_reporting/tests' because a file with the same name already exists in the directory. This is likely because versioning was enabled on a dataset already saved previously. Either remove 'results.csv' from the directory or manually convert it into a versioned dataset by placing it in a versioned directory
open_args_save:mode:"w"
but by setting "w" the file wont append, so it has to be changed back again.It's strange that mode: "a"
does not create a file if it does not exist.
Maybe as a workaround you can create a "utility" node and a dataset that will create some empty csv with header row before the actual node will run so by the time your actual node starts its work the file will be already there.
https://github.com/kedro-org/kedro-plugins/issues/513
I remember there was issue with CSVDataset particularly, which seems to be fixed now with: https://github.com/kedro-org/kedro-plugins/pull/805
Cc @Merel