Join the Kedro community

Updated 3 months ago

Csv column dtypes not being set correctly

At a glance

Hey everyone, I am trying to define the column dtypes of a CSV dataset because some columns contain IDs that Kedro interprets as floats, but should be interpreted as strings instead. Setting

load_args:
  dtype:
    user_id: str

save_args:
  dtype:
    user_id: str

does not seem to work for me. Appreciate your help!

1
R
J
N
9 comments

Hi if you are using pandas.CSVDataset, save_args does not support dtype. Do you get any error with the config above ?

I don't get any error with. It just seems to be ignored

Can you try data casting before saving and use dtype only for load_args ?

Because pandas don't validate the kwargs. If you think about it, it makes sense. You cannot save type as str in CSV because fundamentally it doesn't have type schema.

If you need strong type consistent (which you should if you can, only use CSV at report layer if you need).

Thank you, I agree. I was hoping that Kedro might have some inbuilt type-casting via the conf, so that I don't have to handle it in logic when using CSVs, but understood πŸ‘

You can do that for load_args for type cast, but you cannot for save because CSV is just text

Considering using format like parquet which has strong type and much more efficient than CSV, using csv for data pipeline is one of the most common source of bug

What I don't fully understand is where this is going wrong for you. While I fully agree that you should avoid using CSVs, can you check what the type of the user_id column is upon load? dtype is supported there.

If you want to be more explicit on save, to avoid it being read back as a float, you can do things like set quote style.

So somehow the problem does not persist anymore. I started getting an error for the save_args and can now load as expected. Unfortunately, I am not fully aware what cause the error initially.

Add a reply
Sign up and join the conversation on Slack