Hey everyone, I am trying to define the column dtypes of a CSV dataset because some columns contain IDs that Kedro interprets as floats, but should be interpreted as strings instead. Setting
load_args: dtype: user_id: str save_args: dtype: user_id: str
Hi if you are using pandas.CSVDataset, save_args does not support dtype. Do you get any error with the config above ?
Because pandas don't validate the kwargs. If you think about it, it makes sense. You cannot save type as str in CSV because fundamentally it doesn't have type schema.
If you need strong type consistent (which you should if you can, only use CSV at report layer if you need).
Thank you, I agree. I was hoping that Kedro might have some inbuilt type-casting via the conf, so that I don't have to handle it in logic when using CSVs, but understood π
You can do that for load_args for type cast, but you cannot for save because CSV is just text
Considering using format like parquet which has strong type and much more efficient than CSV, using csv for data pipeline is one of the most common source of bug
What I don't fully understand is where this is going wrong for you. While I fully agree that you should avoid using CSVs, can you check what the type of the user_id
column is upon load? dtype
is supported there.
If you want to be more explicit on save, to avoid it being read back as a float, you can do things like set quote style.
So somehow the problem does not persist anymore. I started getting an error for the save_args and can now load as expected. Unfortunately, I am not fully aware what cause the error initially.